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HUMAN GENOME-DERIVED SINGLE EXON NUCLEIC ACID PROBES USEFUL 
FOR ANALYSIS OF GENE EXPRESSION IN HUMAN ADULT LIVER 



CROSS REFERENCE TO RELATED APPLICATIONS 

5 

The present application is a continuation-in-part of U.S. 
patent application serial nos. 09/632,366, filed August 3, 
2000 and 09/608,408, filed June 30, 2000; claims the 
benefit under 35 U.S.C. s 119(e) of U . S .provisional patent 

10 application serial nos. 60/236,359, filed September 27, 
2000, 60/234,687, filed September 21, 2000, 60/207,456, 
filed May 26, 2000, and 60/180,312, filed February 4, 2000; 
and further claims the benefit under 35 U.S.C. s 119(a) of 
UK patent application no. 0024263.6, filed October 4, 2000, 

15 the disclosures of which are incorporated herein by 
reference in their entireties.' 

REFERENCE TO SEQUENCE LISTING AND INCORPORATION BY 
REFERENCE THEREOF 

20 

The present application includes a Sequence Listing in 
electronic format, filed pursuant to PCT Administrative 
Instructions 801 - 806 on a single CD-R disc, in 
triplicate, containing a file named pto_ADULT_LIVER.txt, 
25 created 24 January 2001, having 26,335,065 bytes. The 
Sequence Listing contained in said file on said disc is 
incorporated herein by reference in its entirety. 

Field of the Invention 

30 

The present invention relates to genome-derived 
single exon microarrays useful for verifying the expression 
of regions of genomic DNA predicted to encode protein. In 
particular, the present invention relates to unique genome- 
35 derived single exon nucleic acid probes expressed in human 



WO 01/57273 PCT/US01/00664 

adult liver and single exon nucleic acid microarrays that 
include such probes. 



Background of the Invention 

5 For almost two decades following the invention of 

general techniques for nucleic acid sequencing, Sanger et 
al., Proc. Natl. Acad. Sci. USA 70 (4) : 1209-13 (1973); 
Gilbert et al. r Proc. Natl. Acad. Sci. USA 70 ( 12) : 3581-4 
(1973), these techniques were used principally as tools to 

10 further the understanding of proteins — known or 

suspected — about which a basic foundation of biological 
knowledge had already been built. In many cases, the 
cloning effort that preceded sequence identification had 
been both informed and directed by that antecedent 

15 biological understanding. 

For example, the cloning of the T cell receptor 
for antigen was predicated upon its known or suspected cell 
type-specific expression, by its suspected membrane 
association, and by the predicted assembly of its gene via 

20 T cell-specific somatic recombination. Subsequent 
sequencing efforts at once confirmed and extended 
understanding of this family of proteins. Hedrick et al., 
Nature 308 (5955) : 153-8 (1984). 

More recently, however, the development of high 

25 throughput sequencing methods and devices, in concert with 
large public and private undertakings to sequence the human 
and other genomes, has altered this investigational 
paradigm: today, sequence information often precedes 
understanding of the basic biology of the encoded protein 

30 product. 

One of the approaches to large-scale sequencing 
is predicated upon the proposition that expressed 
sequences — that is, those accessible through isolation of 
mRNA - are of greatest initial interest. This "expressed 
35 sequence tag" ("EST") approach has already yielded vast 
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amounts of sequence data (see for example Adams et al. r 
Science 252:1651 (1991); Williamson, Drug Discov. Today 
4:115 (1999)). For nucleic acids sequenced by this 
approach, often the only biological information that is 

5 known a priori with any certainty is the likelihood of 
biologic expression itself. By virtue of the species and 
tissue from which the mRNA had originally been obtained, 
most such sequences are also annotated with the identity of 
the species and at least one tissue in which expression 

10 appears likely. 

More recently, the pace of genomic sequencing has 
accelerated dramatically. When genomic DNA serves as the 
initial substrate for sequencing efforts, expression cannot 
be presumed; often the only a priori biological information 

15 about the sequence includes the species and chromosome (and 
perhaps chromosomal map location) of origin. 

With the ever-accelerating pace of sequence 
accumulation by directed, EST, and genomic sequencing 
approaches — and in particular, with the accumulation of 

20 sequence information from multiple genera, from multiple 

species within genera, and from multiple individuals within 
a species — there is an increasing need for methods that 
rapidly and effectively permit the functions of nucleic 
sequences to be elucidated. And as such functional 

25 information accumulates, there is a further need for 
methods of storing such functional information in 
meaningful and useful relationship to the sequence itself; 
that is, there is an increasing need for means and 
apparatus for annotating raw sequence data with known or 

30 predicted functional information. 

Although the increase in the pace of genomic 
sequencing is due in large part to technological changes in 
sequencing strategies and instrumentation, Service, Science 
280:995 (1998); Pennisi, Science 283: 1822-1823 (1999), 

35 there is an important functional motivation as well. 
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While it was understood that the EST approach 
would rarely be able to yield sequence information about 
the noncoding portions of the genome, it now also appears 
the EST approach is capable of capturing only a fraction of 
5 a genome's actual expression complexity. 

For example, when the C. elegans genome was fully 
sequenced, gene prediction algorithms identified over 
19,000 potential genes, _ of which only 7, 000 had been found 
by EST sequencing. C. elegans Sequencing Consortium, 

10 Science 282:2012 (1998). Analogously, the recently 

completed sequence of chromosome 2 of Arabidopsis predicts 
over 4000 genes, Lin et al., Nature, 402:761 (1999), of 
which only about 6% had previously been identified via EST 
sequencing efforts. Although the human genome has the 

15 greatest depth of EST coverage, it is still woefully short 
of surrendering all of its genes. One recent estimate 
suggests that the human genome contains more than 146,000 
genes, which would at this point leave greater than half of 
the genes undiscovered. It is now predicted that many 

20 genes, perhaps 20 to 50%, will only be found by genomic 
sequencing . 

There is, therefore, a need for methods that 
permit the functional regions of genomic sequence — and 
most importantly, but not exclusively, regions that 

25 function to encode genes - to be identified. 

Much of the coding sequence of the human genome 
is not homologous to known genes, making detection of open 
reading frames ("ORFs") and predictions of gene function 
difficult. Computational methods exist for predicting 

30 coding regions in eukaryotic genomes. Gene prediction 
programs such as GRAIL and GRAIL II, Uberbacher et al., 
Proc. Natl. Acad. Sci. USA 88 (24) : 11261-5 (1991); Xu et 
al., Genet. Eng. 16:241-53 (1994); Uberbacher et al., 
Methods Enzymol. 266:259-81 (1996); GENEFINDER, Solovyev et 

35 al., Nucl. Acids. Res. 22:5156-63 (1994); Solovyev et al., 
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Ismb 5:294-302 (1997); and GENE SCAN, Burge et al., J. Mol . 
Biol. 268:78-94 (1997), predict many putative genes without 
known homology or function. Such programs are known, 
however, to give high false positive rates. Burset et al., 
5 Genomics 34:353-367 (1996). Using a consensus obtained by 
a plurality of such programs is known to increase the 
reliability of calling exons from genomic sequence. 
Ansari-Lari et al., Genome Res . 8(l):29-40 (1998) 

Identification of functional genes from genomic 

10' data remains, however, an imperfect art. For example, in 
reporting the full sequence of human chromosome 21, the 
Chromosome 21 Mapping and Sequencing Consortium reports 
that prior bioinf ormatic estimates of human gene number may 
need to be revised substantially downwards. Nature 

15 405:311-199 (2000); Reeves, Nature 405:283-284 (2000). 

Thus, there is a need for methods and apparatus 
that permit the functions of the regions identified 
bioinf ormatically - and specifically, that permit the 
expression of regions predicted to encode protein — readily 

20 to be confirmed experimentally. 

Recently, the development of nucleic acid 
microarrays has made possible the automated and highly 
parallel measurement of gene expression. Reviewed in 
Schena (ed. ) , DNA Microarrays : A Practical Approach 

25 (Practical Approach Series ), Oxford University Press (1999) 
(ISBN: 0199637768); Nature Genet. 21 ( 1 ) ( suppl ) : 1 - 60 
(1999); Schena (ed.), Microarray Biochip: Tools and 
Technology , Eaton Publishing Company/BioTechniques Books 
Division (2000) (ISBN: 1881299376). 

30 It is common for microarrays to be derived from 

cDNA/EST libraries, either from those previously described 
in the literature, such as those from the I.M.A.G.E. 
consortium, Lennon et al., Genomics 33(1): 151-2 (1996), or 
from the construction of "problem specific" libraries 

35 targeted at a particular biological question, R.S. Thomas 

5 
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et al., Cancer Res. (in press). Such micro-arrays by 
definition can measure expression only of those genes found 
in EST libraries, and thus have not been useful as probes 
for genes discovered solely by genomic sequencing. 
5 The utility of using whole genome nucleic acid 

microarrays to answer certain biological questions has been 
demonstrated for the yeast Saccharomyces cerevisiae. De 
Risi et al., Science 278:680 (1997). The vast majority of 
yeast nuclear genes, approximately 95% however, are single 

10 exon genes, i.e., lack introns, Lopez et al., RNA 5:1135- 
1137 (1999); Goffeau et al., Science 274:563-67 (1996), 
permitting coding regions more readily to be identified. 
Whole genome nucleic acid microarrays have not generally 
been used to probe gene expression from more complex 

15 eukaryotic genomes, and in particular from those averaging 
more than one intron per gene. 

Diseases of the liver are a significant cause of 
human morbidity and mortality. Increasingly, genetic 
factors are being found that contribute to predisposition, 

20 onset, and/or aggressiveness of most, if not all, of these 
diseases; although causative mutations in single genes have 
been identified for some, these disorders are believed for 
the most part to have polygenic etiologies. There is a need 
for methods and apparatus that permit prediction, diagnosis 

25 and prognosis of diseases of the liver particularly those 
diseases with polygenic etiologies. 

Summary of the Invention 

30 The present invention solves these and other 

problems in the art by providing methods and apparatus for 
predicting, confirming, and displaying functional 
information derived from genomic sequence. The present 
invention also provides apparatus for verifying the 

35 expression of putative genes identified within genomic 
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sequence . 

In particular, the invention provides novel 
genome -derived single exon nucleic acid microarrays useful 
for verifying the expression of putative genes identified 
5 within genomic sequence. 

The present invention also provides compositions 
and kits for the ready production of nucleic acids 
identical 'in sequence to, or substantially identical in 
sequence to, probes on the genome-derived single exon 
10 microarrays of the present invention. 

Accordingly, in a first aspect of the invention, 
there is provided a spatially-addressable set of single 
exon nucleic acid probes for measuring gene expression in a 
sample derived from human adult liver, comprising a 
15 plurality of single exon nucleic acid probes according to 
any one of the nucleotide sequences set out in SEQ ID NOs : 
1 - 13,109or a complementary sequence, or a portion of such 
a sequence. 

By plurality is meant at least two, suitably at 
20 least 20, most suitably at least 100, preferably at least 
1000 and, most preferably, upto 5000. 

In one embodiment of the first aspect, each of 
said plurality of probes is separately and addressably 
amplif iable . 

25 In an alternative embodiment, each of said 

plurality of probes is separately and addressably 

isolatable from said plurality. 

In a preferred embodiment, each of said plurality 

of probes is amplifiable using at least one common primer. 
30 Preferably, each of said plurality of probes is amplifiable 

using a first and a second common primer. 

In yet another embodiment, said set of single 

exon nucleic acid probes comprises between 50 - 20,000 

probes, for example, 50 - 5000. 
35 Suitably, said set of single exon nucleic acid 
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probes comprises at least 50 - 1000 discrete single exon 
nucleic acid probes having a sequence as set out in any of 
SEQ ID NOS.: 1 - 25,995or a complimentary sequence, or a 
portion of such a sequence. 
5 Preferably, the average length of the single exon 

nucleic acid probes is between 200 and 500 bp. It is 
preferred that the average length should be at least 200bp, 
suitably at least 250bp, most suitably at least 300bp, 
preferably at least 400bp and, most preferably, 500 bp. 

10 In another embodiment, the single exon nucleic 

acid probes lack prokaryotic and bacteriophage vector 
sequence. It is preferred that at least 50%, suitably at 
least 60%, most suitably at least 70%, preferably at least 
75%, more preferably at least 80, 85, 90, 95 or 99% of said 

15 single exon nucleic acid probes lack prokaryotic and 
bacteriophage vector sequence. 

In another preferred embodiment, said single exon 
nucleic acid lack homopolymeric stretches of A or T. It is 
preferred that at least 50%, suitably at least 60%, most 

20 suitably at least 70%, preferably at least 75%, more 

preferably at least 80, 85, 90, 95 or 99% of said single 
exon nucleic acid probes lack homopolymeric stretches of A 
or T. 

Preferably, a spatially-addressable set of single 
25 exon nucleic acid probes in accordance with the first 

aspect of the invention is is addressably disposed upon a 
substrate . 

Suitable substrates include a filter membrane 
which may, preferably, be nitrocellulose or nylon. The 

30 nylon may preferably, be positively-charged. Other suitable 
substrates include glass, amorphous silicon, crystalline 
silicon, and plastic. Further suitable materials include 
polymethylacrylic, polyethylene, polypropylene, 
polyacrylate, polymethylmethacrylate, polyvinylchloride , 

35 polytetraf luoroethylene, polystyrene, polycarbonate, 
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polyacetal, polysulfone, celluloseacetate, 
cellulosenitrate, nitrocellulose, and mixtures thereof. 

In a second aspect of the invention, there is 
provided a microarray comprising a spatially addressable 
5 set of single exon nucleic acid probes in accordance with 
the first aspect of the invention. 

In one embodiment, a genome-derived single-exon 
microarray is packaged together with such an ordered set o 
amplifiable probes corresponding to the probes, or one or 
10 more subsets of probes, thereon. In alternative 

embodiments, the ordered set of amplifiable probes is 
packaged separately from the genome-derived single exon 
microarray . 

In another aspect, the invention provides genome 

15 derived single exon nucleic acid probes useful for gene 
expression analysis, and particularly for gene expression 
analysis by microarray. In particular embodiments of this 
aspect, the present invention provides human single-exon 
probes that include specif ically-hybridizable fragments of 

20 SEQ ID Nos. 13,110 - 25,995, wherein the fragment 

hybridizes at high stringency to an expressed human gene. 
In particular embodiments, the invention provides single 
exon probes comprising SEQ ID Nos. 1 - 13,109. 

Accordingly, in a third aspect of the invention, 

25 there is provided a single exon nucleic acid probe for 
measuring human gene expression in a sample derived from 
human adult liver which is a nucleic acid molecule 
comprising a nucleotide sequence as set out in any of SEQ 
ID NOs.: 1 - 13,109or a complementary sequence or a 

30 fragment thereof wherein said probe hybridizes at high 

stringency to a nucleic acid expressed in the human adult 
liver . 

In one embodiment, a single exon nucleic acid 
probe in accordance with the third aspect comprises a 
35 nucleotide sequence as set out in any of SEQ ID NOs.: 
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13,110 - 25,995 or a complementary sequence or a fragment 
thereof. 

In a fourth aspect of the invention, there is 
provided a single exon nucleic acid probe for measuring 
5 human gene expression in a sample derived from human adult 
liver which is a nucleic acid molecule having a sequence 
encoding a peptide comprising a peptide sequence as set out 
in any of SEQ ID NOs.: 25,996 - 38,578 or a complementary 
sequence or a fragment thereof wherein said probe 

10 hybridizes at high stringency to a nucleic acid expressed 
in the human adult liver. 

Preferably, a single exon nucleic acid probe in 
accordance with the third or fourth aspects of the 
invention comprises between at least 15 and 50 contiguous 

15 nucleotides of said SEQ ID NO: . It is preferred that the 
single exon nucleic acid probe comprises at least 15, 
suitably at least 20, more suitably at least 25 or 
preferably at least 50 contiguous nucleotides of said SEQ 
ID NO: . 

20 In another preferred embodiment, a single exon 

nucleic acid probe in accordance with the third or fourth 
aspects of the invention is between 3kb and 25kb in length. 
It is preferred that said probe is no more than 3kb, 
suitably no more than 5kb, more suitably no more than lOkb, 

25 preferably 15kb, more preferably 20kb or, most preferably, 
no more than 20kb in length. 

Preferably, a single exon nucleic acid probe in 
accordance with either the fifth or sixth aspect of the 
invention is DNA, preferably single-stranded DNA, RNA or 

30 PNA. 

In another embodiment of either the third or 
fourth aspect of the invention, a single exon nucleic acid 
probe is detectably labeled. Suitable detectable labels 
include a radionuclide, a fluorescent label or a first 
35 member of a specific binding pair. Suitable fluorescent 

10 
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labels include dyes such as cyanine dyes, preferably Cy3 
and Cy5 although other suitable dyes will be known to those 
skilled in the art. 

In a particularly preferred embodiment, a single 
5 exon nucleic acid probe in accordance with either the third 
or fourth aspect of the invention lacks prokaryotic and 
bacteriophage vector sequence. In yet another embodiment, a 
single exon nucleic acid probe in accordance with either 
the third or fourth aspect of the invention lacks 
10 homopolymeric stretches of A or T. 

In a fifth aspect of the invention, there is 
provided an amplifiable nucleic acid composition, 
comprising: 

the single exon nucleic acid probe in accordance 
15 with either of the third or fourth aspects of the 
invention; and at least one nucleic acid primer; 

wherein said at least one primer is sufficient to 
prime enzymatic amplification of said probe. 

In an sixth. aspect of the invention, there is 
20 provided a method of measuring gene expression in a sample 
derived from human adult liver, comprising: 

contacting the single exon microarray in 
accordance with the second aspect of the invention, with a 
first collection of detectably labeled nucleic acids, said 
25 first collection of nucleic acids derived from mRNA of 
human adult liver; and then 

measuring the label detectably bound to each 
probe of said microarray. 

In a seventh aspect of the invention, there is 
30 provided a method of identifying exons in a eukaryotic 
genome, comprising: 

algorithmically predicting at least one exon from 
genomic sequence of said eukaryote; and then 

detecting specific hybridization of detectably 
35 labeled nucleic acids to a single exon probe, 

11 



WO 01/57273 PCT/US01/00664 

wherein said detectably labeled nucleic acids are 
derived from mRNA from the adult liver of said eukaryote, 
said probe is a single exon probe having a fragment 
identical in sequence to, or complementary in sequence to, 
5 said predicted exon, said probe is included within a single 
exon microarray in accordance with the first aspect of the 
invention, and said fragment is selectively hybridizable at 
high stringency. 

In a eighth aspect of the invention, there is 
10 provided a method of assigning exons to a single gene, 
comprising : 

identifying a plurality of exons from genomic 

sequence in accordance with the seventh aspect of the 

invention; and then 
15 measuring the expression of each of said exons in 

a plurality of tissues and/or cell types using 

hybridization to single exon microarrays having a probe 

with said exon, 

wherein a common pattern of expression of said 
20 exons in said plurality of tissues and/or cell types 

indicates that the exons should be assigned to a single 

gene . 

In an ninth aspect of the invention, there is 
provided a nucleic acid sequence as set out in any of SEQ 
25 ID NOs: 1 - 25, 995wherein said sequence encodes a peptide. 

In a tenth aspect of the invention, there is 
provided a peptide encoded by a sequence comprising a 
sequence as set out in any of SEQ ID NOs: 13,110 - 25,995, 
or a complementary sequence or coding portion thereof. 
30 In a preferred embodiment, a peptide may be 

encoded by a sequence comprising a sequence set out in any 
of SEQ ID NOS. : 1 - 13,109. 

In a further aspect, the invention provides 
peptides comprising an amino acid sequence translated from 
35 the DNA fragments, said amino acid sequences comprising SEQ 

12 
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ID NOS. : 25, 996 - 38,578 . 

Accordingly in a eleventh aspect of the invention 
there is provided a peptide comprising a sequence as set 
out in any of SEQ ID NOs: 25,996 - 38,578, or fragment 
5 thereof. 

In another aspect, the invention provides means 
for displaying annotated sequence, and in particular, for 
displaying sequence annotated according to the methods and 
apparatus of the present invention. Further, such display 
10 can be used as a preferred graphical user interface for 
electronic search, query, and analysis of such annotated 
sequence . 



15 Detailed Description of the Invention 
Definitions 

As used herein, the term "microarray" and phrase 
"nucleic acid microarray" refer to a substrate-bound 

20 collection of plural nucleic acids, hybridization to each 
of the plurality of bound nucleic acids being separately 
detectable. The substrate can be solid or porous, planar 
or non-planar, unitary or distributed. 

As so defined, the term "microarray" and phrase 

25 "nucleic acid microarray" include all the devices so called 
in Schena (ed.), DNA Microarrays: A Practical Approach 
(Practical Approach Series ), Oxford University Press (1999) 
(ISBN: 0199637768); Nature Genet. 21 ( 1 ) ( suppl ) : 1 - 60 
(1999); and Schena (ed.), Microarray Biochip: Tools and 

30 Technology , Eaton Publishing Company/BioTechniques Books 
Division (2000) (ISBN : 1881299376) . As so defined, the 
term "microarray" and phrase "nucleic acid microarray" 
further include substrate-bound collections of plural 
nucleic acids in which the nucleic acids are distributably 

35 disposed on a plurality of beads, rather than on a unitary 
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planar substrate, as is described, inter alia, in Brenner 
et al. r Proc. Natl. Acad. Sci. USA 97(4): 166501670 (2000); 
in such case, the term "microarray" and phrase "nucleic 
acid microarray" refer to the plurality of beads in 
5 aggregate. 

As used herein with respect to a nucleic acid 
microarray, the term "probe" refers to the nucleic acid 
that is, or is intended to be, bound to the substrate; in 
such context, the term "target" thus refers to nucleic acid 

10 intended to be bound thereto by Watson-Crick 

complementarity. As used herein with respect to solution 
phase hybridization, the term "probe" refers to the nucleic 
acid of known sequence that is detectably labeled. 

As used herein, the expression "probe comprising 

15 SEQ ID NO.", and variants thereof, intends a nucleic acid 
probe, at least a portion of which probe has either (i) the 
sequence directly as given in the referenced SEQ ID NO., or 
(ii) a sequence complementary to the sequence as given in 
the referenced SEQ ID NO., the choice as between sequence 

20 directly as given and complement thereof dictated by the 
requirement that the probe hybridize to mRNA. 

As used herein, the term "open reading frame" and 
the equivalent acronym "ORF" refer to that portion of an 
exon that can be translated in its entirety into a sequence 

25 of contiguous amino acids i.e. a nucleic acid sequence 

that, in at least one reading frame, does not possess stop 
codons; the term does not require that the ORF encode the 
entirety of a natural protein. 

As used herein, the term "amplicon" refers to a 

30 PCR product amplified from human genomic DNA, containing 
the predicted exon. 

As used herein the term "exon" refers to the 
consensus prediction of the various exon and gene 
predicting algorithms i.e. a nucleic acid sequence 

35 bioinformatically predicted to encode a portion of a 
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natural protein. 

As used herein, the term "peptide" refers to a 
sequence of amino acids. The sequences referred to as 
PEPTIDE SEQ ID NOS . : are the predicted peptide sequences 

5 that would be translated from one of the exons, or a 

portion thereof set out in exon SEQ ID NOS.:. The codons 
encoding the peptide are wholly contained within the exon. 

As used herein, a "portions" of a defined 
nucleotide sequence or sequences can be and, preferably, 

10 are fragments unique to that sequence or to one or a 
combination of those sequences. A fragment unique to a 
nucleic acid molecule is one that is a signature for the 
larger nucleic acid molecule. 

As used herein, the phrase "expression of a 

15 probe" and its linguistic variants means that the ORF 
present within the probe, or its complement, is present 
within a target mRNA. 

As used herein, "stringent conditions" refers to 
parameters well known to those skilled in the art. When a 

20 nucleic acid molecule is said to be hybridisable to another 
of a given sequence under "stringent conditions" it is 
meant that it is homologous to the given sequence. 

As used herein, the phrase "specific binding 
pair" intends a pair of molecules that bind to one another 

25 with high specificity. Binding pairs are said to exhibit 
specific binding when they exhibit avidity of at least 10 7 , 
preferably at least 10 s , more preferably at least 10 9 
liters/mole. Nonlimiting examples of specific binding 
pairs are: antibody and antigen; biotin and avidin; and 

30 biotin and streptavidin. 

As used herein with respect to the visual display 
of annotated genomic sequence, the term "rectangle" means 
any geometric shape that has at least a first and a second 
border, wherein the first and second borders each are 

35 capable of mapping uniquely to a point of another visual 

15 
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object of the display. 

As used herein, a "Mondrian" means a visual 
display in which a single genomic sequence is annotated 
with predicted and experimentally confirmed functional 
5 information. 



Brief Description of the Drawings 

10 The present invention is further illustrated with 

reference to the following non-limiting figures and 
examples in which: 

FIG. 1 illustrates a process for predicting 
functional regions from genomic sequence, confirming the 

15 functional activity of such regions experimentally, and 
associating and displaying the data so obtained in 
meaningful and useful relationship to the original sequence 
data; 

FIG. 2 further elaborates that portion of the 
20 process schematized in FIG. 1 for predicting functional 
regions from genomic sequence; 

FIG. 3 illustrates a Mondrian visual display; 

FIG. 4 presents a Mondrian showing a hypothetical 
annotated genomic sequence; 
25 FIG. 5 is a histogram showing the distribution of 

ORF length and PCR products as obtained, with ORF length 
shown in black and PCR product length shown in dotted 
lines; 

FIG. 6 is a histogram showing the distribution, 
30 among exons predicted according to the methods described, 
of expression as measured using simultaneous two color 
hybridization to a genome-derived single exon microarray. 
The graph shows the number of sequence-verified products 
that were either not expressed ("0") , expressed in one or 
35 more but not all tested tissues ("1" - "9"), or expressed 
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in all tissues tested ("10"); 

FIG. 7 is a pictorial representation of the 
expression of verified sequences that showed expression 
with signal intensity greater than 3 in at least one 
5 tissue, with: FIG. 7A showing the expression as measured by 
microarray hybridization in each of the 10 measured 
tissues, and the expression as measured "bioinf ormatically" 
by query of EST, NR and SwissProt databases; with FIG. 7B 
showing the legend for display of physical expression 
10 (ratio) in FIG. 7A; and with FIG. 7C showing the legend for 
scoring EST hits as depicted in FIG. 7A; 

FIG. 8 shows a comparison of normalized CY3 
signal intensity for arrayed sequences that were identical 
to sequences in existing EST, NR and SwissProt databases or 
15 that were dissimilar (unknown) , where black denotes the 

signal intensity for all sequence-verified products with a 
BLAST Expect ("E") value of greater than le-30 (1 x 10~ 30 ) 
("unknown") and a dotted line denotes sequence-verified 
spots with a BLAST expect ("E") value of less than le-30 (1 
20 x 10~ 30 ) ( "known") ; 

FIG. 9 presents a Mondrian of BAC AC008172 (bases 
25,000 to 130,000), containing the carbamyl phosphate 
synthetase gene (AF154830 . 1) ; and 

FIG. 10 is a Mondrian of BAC A049839. 

25 

Methods and Apparatus for Predicting, Confirming, 
Annotating, and Displaying Functional Regions From Genomic 
Sequence Data 

30 

FIG. 1 is a flow chart illustrating in broad • 
outline a process for predicting functional regions from 
genomic sequence, confirming and characterizing the 
functional activity of such regions experimentally, and 
35 then associating and displaying the information so obtained 
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in meaningful and useful relationship to the original 
sequence data. 

The initial input into process 10 of the present 
invention is drawn from one or more databases 100 
5 containing genomic sequence data. Because genomic sequence 
is usually obtained from subgenomic fragments, the sequence 
data typically will be stored in a series of records 
corresponding to these subgenomic sequenced fragments. 
Some fragments will have been catenated to form larger 

10 contiguous sequences ( "contigs" ) ; others will not. A 
finite percentage of sequence data in the database will 
typically be erroneous, consisting inter alia of vector 
sequence, sequence created from aberrant cloning events, 
sequence of artificial polylinkers, and sequence that was 

15 erroneously read. 

Each sequence record in database 100 will 
minimally contain as annotation a unique sequence 
identifier (accession number), and will typically be 
annotated further to identify the date of accession, 

20 species of origin, and depositor. Because database 100 can 
contain nongenomic sequence, each sequence will typically 
be annotated further to permit query for genomic sequence. 
Chromosomal origin, optionally with map location, can also 
be present. Data can be, and over time increasingly will 

25 be, further annotated with additional information, in part 
through use of the present invention, as described below. 
Annotation can be present within the data records, in 
information external to database 100 and linked to the 
records thereto, or through a combination of the two. 

30 Databases useful as genomic sequence database 100 

in the present invention include GenBank, and particularly 
include several divisions thereof, including the 
htgs (draft), NT (nucleotide, command line), and NR 
(nonredundant) divisions. GenBank is produced by the 

35 National Institutes of Health and is maintained by the 
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National Center for Biotechnology Information (NCBI) . 
Databases of genomic sequence from species other than 
human, such as mouse, rat, Arabidopsis, C. elegans, C. 
brigsii, Drosophila, zebra fish, and other higher 
5 eukaryotic organisms will also prove useful as genomic 
sequence database 100. 

Genomic sequence obtained by query of genomic 
sequence database 100 is then input into one or more 
processes 200 for identification of regions therein that 

10 are predicted to have a biological function as specified by 
the user. Such functions include, but are not limited to, 
encoding protein, regulating transcription, regulating 
message transport after transcription into mRNA, regulating 
message splicing after transcription into mRNA, of 

15 regulating message degradation after transcription into 
mRNA, and the like. Other functions include directing 
somatic recombination events, contributing to chromosomal 
stability or movement, contributing to allelic exclusion or 
X chromosome inactivation, and the like. 

20 The particular genomic sequence to be input into 

process 200 will depend upon the function for which 
relevant sequence is to be identified as well as upon the 
approach chosen for such identification. Process step 200 
can be iterated to identify different functions within a 

25 given genomic region. In such case, the input often will 
be different for the several iterations. 

Sequences predicted to have the requisite 
function by process 200 are then input into process 300, 
where a subset of the input sequences suitable for 

30 experimental confirmation is identified. Experimental 
confirmation can involve physical and/or bioinf ormat ic 
assay. Where the subsequent experimental assay is 
bioinformatic, rather than physical, there are fewer 
constraints on the sequences that can be tested, and in 

35 this latter case therefore process 300 can output the 
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entirety of the input sequence. 

The subset of sequences output from process 300 
is then used in process 400 for experimental verification 
and characterization of the function predicted in 

5 process 200, which experimental verification can, and often 
will, include both physical and bioinf ormatic assay. 

Process 500 annotates the sequence data with the 
functional information obtained in the physical and/or 
bioinf ormatic assays of process 400. Such annotation can 

10 be done using any technique that usefully relates the 

functional information to the sequence, as, for example, by 
incorporating the functional data into the sequence data 
record itself, by linking records in a hierarchical or 
relational database, by linking to external databases, by a 

15 combination thereof, or by other means well known within 
the database arts. The data can even be submitted for 
incorporation into databases maintained by others, such as 
GenBank, which is maintained by NCBI . 

As further noted in FIG. 1, additional annotation 

20 can be input into process 500 from external sources 600. 

The annotated data is then displayed in process 
800, either before, concomitantly with, or after optional 
storage 700 on nontransient media, such as magnetic disk, 
optical disc, magnetooptical disk, flash memory, or the 

25 like. 

FIG. 1 shows that the experimental data output 
from process 400 can be used in each preceding step of 
process 10: e.g., facilitating identification of functional 
sequences in process 200, facilitating identification of an 

30 experimentally suitable subset thereof in process 300, and 
facilitating creation of physical and/or informational 
substrates for, and performance of subsequent assay, of 
functional sequences in process 400. 

Information from each step can be passed directly 

35 to the succeeding process, or stored in permanent or 
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interim form prior to passage to the succeeding process. 
Often, data will be stored after each, or at least a 
plurality, of such process steps. Any or all process steps 
can be automated. 
5 FIG. 2 further elaborates the prediction of 

functional sequence within genomic sequence according to 
process 200. 

Genomic sequence database 100 is first queried 20 
for genomic sequence. 
10 The sequence required to be returned by query 20 

will depend, in the first instance, upon the function to be 
identified. 

For example, genomic sequences that function to 
encode protein can be identified inter alia using gene 

15 prediction approaches, comparative sequence analysis 

approaches, or combinations of the two. In gene prediction 
analysis, sequence from one genome is input into process 
200 where at least one, preferably a plurality, of 
algorithmic methods are applied to identify putative coding 

20 regions. In comparative sequence analysis, by contrast, 

corresponding, e.g., syntenic, sequence from a plurality of 
sources, typically a plurality of species, is input into 
process 200, where at least one, possibly a plurality, of 
algorithmic methods are applied to compare the sequences 

25 and identify regions of least variability. 

The exact content of query 20 will also depend 
upon the database queried. For example, if the database 
contains both genomic and nongenomic sequence, perhaps 
derived from multiple species, and the function to be 

30 determined is protein coding regions in human genomic 
sequence, the query will accordingly require that the 
sequence returned be genomic and derived from humans. 

Query 20 can also incorporate criteria that 
compel return of sequence that meets operative requirements 

35 of the subsequent analytical method. Alternatively, or in 
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addition, such operative criteria can be enforced in 
subsequent preprocess step 24. 

For example, if the function sought to be 
identified is protein coding, query 20 can incorporate 
5 criteria that return from genomic sequence database 100 
only those sequences present within contigs sufficiently 
long as to have obviated substantial fragmentation of any 
given exon among a plurality of separate sequence 
fragments . 

10 Such criteria can, for example, consist of a 

required minimal individual genomic sequence fragment 
length, such as 10 kb, more typically 20 kb, 30 kb, 40kb, 
and preferably 50 kb or more, as well as an optional 
further or alternative requirement that sequence from any 

15 given clone, such as a bacterial artificial chromosome 
("BAC"), be presented in no more than a finite maximal 
number of fragments, such as no more than 20 separate 
pieces, more typically no more than 15 fragments, even more 
typically no more than about 10 - 12 fragments. 

20 Results using the present invention have shown 

that genomic sequence from bacterial artificial chromosomes 
(BACs) is sufficient for gene prediction analysis according 
to the present invention if the sequence is at least 50 kb 
in length, and if additionally the sequence from any given 

25 BAC is presented in fewer than 15, and preferably fewer 
than 10, fragments. Accordingly, query 20 can incorporate 
a requirement that data accessioned from BAC sequencing be 
in fewer than 15, preferably fewer than 10, fragments. 

An additional criterion that can be incorporated 

30 into the query can be the date, or range of dates, of 
sequence accession. Although the process has been 
described above as if genomic sequence database 100 were 
static, it is of course understood that the genomic 
sequence databases need not be static, and indeed are 

35 typically updated on a frequent, even hourly, basis. Thus, 
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as further described in Examples 1 and 2, infra, it is 
possible to query the database for newly added sequence, 
either newly added after an absolute date, or newly added 
relative to a prior analysis performed using the methods 
5 and apparatus of the present invention. In this way, the 
process herein described can incorporate a dynamic, 
temporal component. 

One utility of such temporal limitation is to 
identify, from newly accessioned genomic sequence, the 

10 presence of novel genes, particularly those not previously 
identified by EST sequencing (or other sequencing efforts 
that are similarly based upon gene expression) . As further 
described in Example 1, such an approach has shown that 
newly accessioned human genomic sequence, when analyzed for 

15 sequences that function to encode protein, readily 

identifies genes that are novel over those in existing EST 
and other expression databases. This makes the methods of 
the present invention extremely powerful gene discovery 
tools. And as would be appreciated, such gene discovery 

20 can be performed using genomic sequence from species other 
than human. 

If query 20 incorporates multiple criteria, such 
as above-described, the multiple criteria can be performed 
as a series of separate queries or as a single query, 

25 depending in part upon the query language, the complexity 
of the query, and other considerations well known in the 
database arts. 

If query 20 returns no genomic sequence meeting 
the query criteria, the negative result can be reported by 

30 process 22, and process 200 (and indeed, entire process 10) 
ended 23, as shown. Alternatively, or in addition to 
report and termination of the initial inquiry, a new query 
20 can be generated that takes into account the initial 
negative result. 

35 When query 20 returns sequence meeting the query 
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criteria, the returned sequence is then passed to optional 
preprocessing 24, suitable and specific for the desired 
analytical approach and the particular analytical methods 
thereof to be used in process 25. 
5 Preprocessing 24 can include processes suitable 

for many approaches and methods thereof, as well as 
processes specifically suited for the intended subsequent 
analysis . 

Preprocessing 24 suitable for most approaches and 

10 methods will include elimination of sequence irrelevant to, 
or that would interfere with, the subsequent analysis. 
Such sequence includes repetitive sequence, such as Alu 
repeats and LINE elements, vector sequence, artificial 
sequence, such as artificial polylinkers, and the like. 

15 Such removal can readily be performed by identification and 
subsequent masking of the undesired sequence. 

Identification can be effected by comparing the 
genomic sequence returned by query 20 with public or 
private databases containing known repetitive sequence, 

20 vector sequence, artificial sequence, and other artifactual 
sequence. Such comparison can readily be done using 
programs well known in the art, such as CROSS_MATCH, or by 
proprietary sequence comparison programs the engineering of 
which is well within the skill in the art. 

25 Alternatively, or in addition, undesirable, 

including artifactual, sequence can be identified 
algorithmically without comparison to external databases 
and thereafter removed. For example, synthetic polylinker 
sequence can be identified by an algorithm that identifies 

30 a significantly higher than average density of known 

restriction sites. As another example, vector sequence can 
be identified by algorithms that identify nucleotide or 
codon usage at variance with that of the bulk of the 
genomic sequence. 

35 Once identified, undesired sequence can be 
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removed. Removal can usefully be done by masking the 
undesired sequence as, for example, by converting the 
specific nucleotide references to one that is unrecognized 
by the subsequent bioinf ormatic algorithms, such as "X". 
5 Alternatively, but at present less preferred, the undesired 
sequence can be excised from the returned genomic sequence, 
leaving gaps. 

Preprocessing 24 can further include selection 
from among duplicative sequences of that one sequence of 

10 highest quality. Higher quality can be measured as a lower 
percentage of, fewest number of, or least densely clustered 
occurrence of ambiguous nucleotides, defined as those 
nucleotides that are identified in the genomic sequence 
using symbols indicating ambiguity. Higher quality can 

15 also or alternatively be valued by presence in the longest 
contig. 

Preprocessing 24 can, and often will, also 
include formatting of the data as specifically appropriate 
for passage to the analytical algorithms of process 25. 

20 Such formatting can and typically will include, Inter alia, 
addition of a unique sequence identifier, either derived 
from the original accession number in genomic sequence 
database 100, or newly applied, and can further include 
additional annotation. Formatting can include conversion 

25 from one to another sequence listing standard, such as 

conversion to or from FAST A or the like, depending upon the 
input expected by the subsequent process. 

Preprocessing, which can be optional depending 
upon the function desired to be identified and the 

30 informational requirements of the methods for effecting 

such identification, is followed by sequence processing 25, 
where sequences with the desired function are identified 
within the genomic sequence. 

As mentioned above, such functions can include, 

35 but are not limited to, encoding protein, regulating 
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transcription, regulating message transport after 
transcription into mRNA, regulating message splicing after 
transcription, of regulating message degradation, and the 
like. Other functions include directing somatic 
5 recombination events, contributing to chromosomal stability 
or movement, contributing to allelic exclusion or X 
chromosome inactivation, or the like. 

The methods of the present invention are 
particularly useful for gene discovery, that is, for 

10 identifying, from genomic sequence, regions that, function 
to encode genes, and in a particularly useful embodiment, 
for identifying regions that function to encode genes not 
hitherto identified by expression-based or directed cloning 
and sequencing. In conjunction with verification using the 

15 novel single exon microarrays of the present invention, as 
further described below, the methods herein described 
become powerful gene discovery tools. 

Accordingly, in a preferred embodiment of the 
present invention, process 25 is used to identify putative 

20 coding regions. Two preferred approaches in process 25 for 
identifying sequence that encodes putative genes are gene 
prediction and comparative sequence analysis. 

Gene prediction can be performed using any of a 
number of algorithmic methods, embodied in one or more 

25 software programs, that identify open reading frames (ORFs) 
using a variety of heuristics, such as GRAIL, DICTION, and 
GENEFINDER. Comparative sequence analysis similarly can be 
performed using any of a variety of known programs that 
identify regions with lower sequence variability. 

30 As further described in Example 1, below, gene 

finding software programs yield a range of results. For 
the newly accessioned human genomic sequence input in 
Example 1, for example, GRAIL identified the greatest 
percentage of genomic sequence as putative coding region, 

35 2% of the data analyzed; GENEFINDER was second, calling 1%; 
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and DICTION yielded the least putative coding region, with 
0.8% of genomic sequence called as coding region. 

Increased reliability can be obtained when 
consensus is required among several such methods. Although 
5 discussed herein particularly with respect to exon calling, 
consensus among methods will in general increase 
reliability of predicting other functions as well. 

Thus, as indicated by query 26, sequence 
processing 25, optionally with preprocessing 24, can be 

10 repeated with a different method, with consensus among such 
iterations determined and reported in process 27. 

Process 27 compares the several outputs for a 
given input genomic sequence and identifies consensus among 
the separately reported results. The consensus itself, as 

15 well as the sequence meeting that consensus, is then stored 
in process 29a, displayed in process 29b, and/or output to 
process 300 for subsequent identification of a subset 
thereof suitable for assay. 

Multiple levels of consensus can be calculated 

20 and reported by process 27. For example, as further 
described in Example 1, infra, process 27 can report 
consensus as between all specific pairs of methods of gene 
prediction, as consensus among any one or more of the pairs 
of methods of gene prediction, or as among all of the gene 

25 prediction algorithms used. Thus, in Example 1, process 27 
reported that GRAIL and GENEFINDER programs agreed on 0.7% 
of genomic sequence, that GRAIL and DICTION agreed on 0.5% 
of genomic sequence, and that the three programs together 
agreed on 0.25% of the data analyzed. Put another way, 

30 0.25% of the genomic sequence was identified by all three 
of the programs as containing putative coding region. 

Furthermore, consensus can be required among 
different approaches to identifying a chosen function. 

For example, if the function desired to be 

35 identified is coding of protein sequence, and a first used 
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approach to exon calling is gene prediction, the process 
can be repeated on the same input sequence, or subset 
thereof, with another approach, such as comparative 
sequence analysis. In such a case, where comparative 
5 sequence analysis follows gene prediction, the comparison 
can be performed not only on genomic nucleic acid sequence, 
but additionally or alternatively can be performed on the 
predicted amino acid sequence translated from the ORFs 
prior identified by the gene prediction approach. 

10 Although shown as an iterative process, the 

multiple analyses required to achieve consensus can be done 
in series, in parallel, or some combination thereof. 

Predicted functional sequence, optionally 
representing a consensus among a plurality of methods and 

15 approaches for determination thereof, is passed to process 
300 for identification of a subset thereof for functional 
assay. 

In the preferred embodiment of the methods of the 
present invention, wherein the function sought to be 

20 identified is protein coding, process 300 is used to 
identify a subset thereof suitable for experimental 
verification by physical and/or bioinf ormatic approaches. 

For example, putative ORFs identified in process 
200 can be classified, or binned, bioinf ormatically into 

25 putative genes. This binning can be based inter alia upon 
consideration of the average number of exons/gene in the 
species chosen for analysis, upon density of exons that 
have been called on the genomic sequence, and other 
empirical rules. Thereafter, one or more among the gene- 

30 specific ORFs can be chosen for subsequent use in gene 
expression assay. 

Where such subsequent gene expression assay uses 
amplified nucleic acid, considerations such as desired 
amplicon length, primer synthesis requirements, putative 

35 exon length, sequence GC content, existence of possible 
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secondary structure, and the like can be used to identify 
and select those ORFs that appear most likely successfully 
to amplify. Where subsequent gene expression assay relies 
upon nucleic acid hybridization, whether or not using 

5 amplified product, further considerations involving 

hybridization stringency can be applied to identify that 
subset of sequences that will most readily permit sequence- 
specific discrimination at a chosen hybridization and wash 
stringency. One particular such consideration is avoidance 

10 of putative exons that span repetitive sequence; such 

sequence can hybridize spuriously to nonspecific message, 
reducing specific signal in the hybridization. 

For bioinformatic assay, there are fewer 
constraints on the sequences that can be tested 

15 experimentally, and in this latter case therefore process 
300 can output the entirety of the input sequence. 

The subset of sequences identified by process 300 
as suitable for use in assay is then used in process 400 to 
create the physical and/or informational substrate for 

20 experimental verification of the predictions made in 
process 200, and thereafter to assay those substrates. 

As mentioned, the methods of the present 
invention are particularly useful for identifying potential 
coding regions within genomic sequence. In a preferred 

25 embodiment of process 400, therefore, the exp'ression of the 
sequences predicted to encode protein is verified. The 
combination of the predictive and experimental methods 
provides a powerful gene discovery engine. 

Thus, in another aspect, the present invention 

30 provides methods and apparatus for verifying the expression 
of putative genes identified within genomic sequence. In 
particular, the invention provides a novel method of 
verifying gene expression in which expression of predicted 
ORFs is measured and confirmed using a novel type of 

35 nucleic acid microarray, the genome-derived single exon 
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nucleic acid microarrays of the present invention. 

Putative ORFs as predicted by a consensus of gene 
calling, particularly gene prediction, algorithms in 
process 200, and as further identified as suitable by 
5 process 300, are amplified from genomic DNA using the 
polymerase chain reaction (PCR) . Although PCR is 
conveniently used, other amplification approaches can also 
be used. 

Amplification schemes can be designed to capture 
10 the entirety of each predicted ORF in an amplicon with 
minimal additional (that is, intronic or intergenic) 
sequence. Because ORFs predicted from human genomic 
sequence using the methods of the present invention differ 
in length, such an approach results in amplicons of varying 
15 length. 

However, most predicted ORFs are shorter than 500 
bp in length, and although amplicons of at least about 100 
or 200 base pairs can be immobilized as probes on nucleic 
acid microarrays, early experimental results using the 

20 methods of the present invention have suggested that longer 
amplicons, at least about 400 or 500 base pairs, are more 
effective. Furthermore, certain advantages derive from 
application to the microarray of amplicons of defined size. 
Therefore, amplification schemes can 

25 alternatively, and preferably, be designed to amplify 

regions of defined size, preferably at least about 300, 400 
or 500 bp, centered about each predicted ORF. Such an 
approach results in a population of amplicons of limited 
size diversity, but that typically contain intronic and/or 

30 intergenic nucleic acid in addition to putative ORF. 

Conversely, somewhat fewer than 10% of ORFs 
predicted from human genomic sequence according to the 
methods of the present invention exceed 500 bp in length. 
Portions of such extended ORFs, preferably at least about 

35 300,400 or 500 bp in length, can be amplified. However, it 
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has been discovered that the percentage success at 
amplifying pieces of such ORFs is low, and that such 
putative ekons are more effectively amplified when larger 
fragments, at least about 1000 or 1500 bp, and even as 
5 large as 2000 bp are amplified. 

The putative ORFs selected in process 300 are 
thus input into one or more primer design programs, such as 
PRIMER3 (available online for use at 

http://www-genome.wi.mit.edu/cgi-bin/primer/ ), with a goal 

10 of amplifying at least about 500 base pairs of genomic 

sequence centered within or about ORFs predicted to be no 
more than about 500 bp, or at least about 1000 - 1500 bp of 
genomic sequence for ORFs predicted to exceed 500 bp in 
length, and the primers synthesized by standard techniques. 

15 Primers with the requisite sequences can be purchased 
commercially or synthesized by standard techniques. 

Conveniently, a first predetermined sequence can 
be added commonly to the ORF-specific 5' primer and a 
second, typically different, predetermined sequence • 

20 commonly added to each 3' ORF-unique primer. This serves 
to immortalize the amplicon, that is, serves to permit 
further amplification of any amplicon using a single set of 
primers complementary respectively to the common 5' and 
common 3' sequence elements. The presence of these 

25 "universal" priming sequences further facilitates later 
sequence verification, providing a sequence common to all 
amplicons at which to prime sequencing reactions. The 
common 5' and 3' sequences further serve to add a cloning 
site should any of the ORFs warrant further study. 

30 Such predetermined sequence is usefully at least 

about 10, 12 or 15 nt in length, and usually does not 
exceed about 25 nt in length. The "universal" priming 
sequences used in the examples presented infra were each 16 
nt long. 

35 The genomic DNA to be used as substrate for 
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amplification will come from the eukaryotic species from 
which the genomic sequence data had originally been 
obtained, or a closely related species, and can 
conveniently be prepared by well known techniques from 

5 somatic or germline tissue or cultured cells of the 

organism. See, e.g., Short Protocols in Molecular Biology 
: A Compendium of Methods from Current Protocols in 
Molecular Biology , Ausubel et al. (eds.), 4 th edition 
(April 1999), John Wiley & Sons (ISBN: 047132938X) and 

10 Maniatis et al., Molecular Cloning : A Laboratory Manual , 
2 nd edition (December 1989), Cold Spring Harbor Laboratory 
Press (ISBN: 0879693096). Many such prepared genomic DNAs 
are available commercially, with the human genomic DNAs 
additionally having certification of donor informed 

15 consent. 

Although the intronic and intergenic material 
flanking putative coding regions in the amplicons could 
potentially interfere with hybridizations during microarray 
experiments, we have found, surprisingly, that differential 

20 expression ratios are not significantly affected. Rather, 
the predominant effect of exon 'size is to alter the 
absolute signal intensity, rather than its ratio. Equally 
surprising, the art had suggested that single exon probes 
would not provide sufficient signal intensity for high 

25 stringency hybridization analyses; we find that such probes 
not only provide adequate signal, but have substantial 
advantages, as herein described. 

After partial purification, as by size exclusion 
spin column, with or without confirmation as to amplicon 

30 quality as by gel electrophoresis, each amplicon (single 
exon probe) is disposed in an array upon a support 
substrate . 

Methods for creating microarrays by deposition 
and fixation of nucleic acids onto support substrates are 
35 well known in the art (Reviewed by Schena et al . , see 
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above) . 

Typically, the support substrate will be glass, 
although other materials, such as amorphous or crystalline 
silicon or plastics. Such plastics include 
5 polymethylacrylic, polyethylene, polypropylene, 

polyacrylate, polymethylmethacrylate, polyvinylchloride , 
polytetraf luoroethylene, polystyrene, polycarbonate, 
polyacetal, polysulfone, celluloseacetate, 

cellulosenitrate, nitrocellulose, or mixtures thereof, can 
10 also be used. Typically, the support will be rectangular, 
although other shapes, particularly circular disks and even 
spheres, present certain advantages. Particularly 
advantageous alternatives to glass slides as support 
substrates for array of nucleic acids are optical discs, as 
15 described in WO 98/12559. 

The amplified nucleic acids can be attached 
covalently to a surface of the support substrate or, more 
typically, applied to a derivatized surface in a chaotropic 
agent that facilitates denaturation and adherence by 
20 presumed noncovalent interactions, or some combination 
thereof . 

Robotic spotting devices useful for arraying 
nucleic acids on support substrates can be constructed 
using public domain specifications (The MGuide, version 

25 2.0, http://cmgm.stanford.edu/pbrown/mguide/index.html), or 
can conveniently be purchased from commercial sources 
(MicroArray Genii Spotter and MicroArray Genlll Spotter, 
Molecular Dynamics, Inc., Sunnyvale, CA) . Spotting can 
also be effected by printing methods, including those using 

30 ink jet technology. 

As is well known in the art, microarrays 
typically also contain immobilized control nucleic acids. 
For controls useful in providing measurements of background 
signal for the genome-derived single exon microarrays of 

35 the present invention, a plurality of E. coli genes can 
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readily be used. As further described in Example 1, 16 or 
32 E. coli genes suffice to provide a robust measure of 
background noise in such microarrays . 

As is well known in the art, the amplified 
5 product disposed in arrays on a support substrate to create 
a nucleic acid microarray can consist entirely of natural 
nucleotides linked by phosphodiester bonds, or 
alternatively can include either nonnative nucleotides, 
alternative internucleotide linkages, or both, so long as 

10 complementary binding can be obtained in the hybridization. 
If enzymatic amplification is used to produce the 
immobilized probes, the amplifying enzyme will impose 
certain further constraints upon the types of nucleic acid 
analogs that can be generated. 

15 Although particularly described herein as using 

high density microarrays constructed on planar substrates, 
the methods of the present invention for confirming the 
expression of ORFs predicted from genomic sequence can use 
any of the known types of microarrays, as herein defined, 

20 including lower density planar arrays, and microarrays on 
nonplanar, nonunitary, distributed substrates. 

For example, gene expression can be confirmed 
using hybridization to lower density arrays, such as -those 
constructed on membranes, such as nitrocellulose, nylon, 

25 and positively-charged derivatized nylon membranes. 
Further, gene expression can also be confirmed using 
nonplanar, bead-based microarrays such as are described in 
Brenner et al., Proc. Natl. Acad. Sci. USA 97 ( 4 ): 166501670 
(2000); U.S. Patent No. 6,057,107; and U.S. Patent No. 

30 5,736,330. In theory, a packed collection of such beads 
provides in aggregate a higher density of nucleic acid 
probe than can be achieved with spotting or lithography 
techniques on a single planar substrate. 

Planar microarrays on solid substrates, however, 

35 provide certain useful advantages, including high 
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throughput and compatibility with existing readers. For 
example, each standard microscope slide can include at 
least 1000, typically at least 2000, preferably 5000 and 
upto 10,000 - 50,000 or more nucleic acid probes of 
5 discrete sequence. The number of sequences deposited will 
depend on their required application. 

Each putative gene can be represented in the 
array by a single predicted ORF. Alternatively, genes can 
be represented by more than one predicted ORF. For 

10 purposes of measuring differential splicing, more than one 
predicted ORF will be provided for a putative gene. And as 
is well known in the art, each probe of defined sequence, 
representing a single predicted ORF, can be deposited in a 
plurality of locations on a single microarray to provide 

15 redundancy of signal. 

The genome-derived single exon microarrays 
described above differ in several fundamental and 
advantageous ways from microarrays presently used in the 
gene expression art, including (1) those created by 

20 deposition of mRNA-derived nucleic acids, (2) those created 
by in situ synthesis of oligonucleotide probes, and (3) 
those constructed from yeast genomic DNA. 

Most nucleic acid microarrays that are in use for 
study of eukaryotic gene expression have as immobilized 

25 probes nucleic acids that are derived — either directly or 
indirectly — from expressed message. As discussed above, 
it is common, for example, for such microarrays to be 
derived from cDNA/EST libraries, either from those 
previously described in the literature, see Lennon et al . , 

30 or from the de novo construction of "problem specific" 
libraries targeted at a particular biological question, 
R.S. Thomas et al., Cancer Res. (in press). Such 
microarrays are herein collectively denominated "EST 
microarrays". 

35 Such EST microarrays by definition can measure 
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expression only of those genes found in EST libraries, 
shown herein to represent only a fraction of expressed 
genes. Furthermore, such libraries — and thus microarrays 
based thereupon - are biased by the tissue or cell type of 
5 message origin, by the expression levels of the respective 
genes within the tissues, and by the ability of the message 
successfully to have been reverse-transcribed and cloned. 

Thus, as further discussed in Example 1, the 
methods of the present invention enable sequences that do 

10 not appear in EST or other expression databases to be 
determined - subsequently arrayed for expression 
measurements could not, therefore, have been represented as 
probes on an EST microarray. And as further demonstrated 
in the examples, infra, the remaining population of genes 

15 identified from genomic sequence by the methods of the 
present invention — that is, the one third of sequences 
that had previously been accessioned in EST or other 
expression databases — are biased toward genes with higher 
expression levels. 

20 Representation of a message in an EST and/or cDNA 

library depends upon the successful reverse transcription, 
optionally but typically with subsequent successful 
cloning, of the message. This introduces substantial, bias 
into the population of probes available for arraying in EST 

25 microarrays. 

In contrast, neither reverse transcription nor 
cloning is required to produce the probes arrayed on the 
genome-derived single exon microarrays of the present 
invention. And although the ultimate deposition of a probe 

30 on the genome-derived single exon microarray of the present 
invention depends upon a successful amplification from 
genomic material, a priori knowledge of the sequence of the 
desired amplicon affords greater opportunity to recover any 
given probe sequence recalcitrant to amplification than is 

35 afforded by the requirement for successful reverse 
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transcription and cloning of unknown message in EST 
approaches . 

Thus, the genome -de rived single exon microarrays 
of the present invention present a far greater diversity of 
5 probes for measuring gene expression, with far less bias, 
than do EST microarrays presently used in the art. 

As a further consequence of their ultimate origin 
from expressed message, the probes in EST microarrays often 
contain poly-A (or complementary poly-T) stretches derived 

10 from the poly-A tail of mature mRNA. These homopolymeric 
stretches contribute to cross-hybridization, that is, to a 
spurious signal occasioned by hybridization to the 
homopolymeric tail of a labeled cDNA that lacks sequence 
homology to the gene-specific portion of the probe. 

15 In contrast, the probes arrayed in the genome- 

derived single exon microarrays of the present invention 
lack homopolymeric stretches derived from message 
polyadenylation, and thus can provide more specific signal. 
Typically, at least about 50, 60 or 75% of the probes on 

20 the genome-derived single exon microarrays of the present 
invention lack homopolymeric regions consisting of A or T, 
where a homopolymeric region is defined for purposes herein 
as stretches of 25 or more, typically 30 or more, identical 
nucleotides . 

25 A further distinction, which also affects the 

specificity of hybridization, is occasioned by the typical 
derivation of EST microarray probes from cloned material. 
Because much of the probe material disposed as probes on 
EST microarrays is excised or amplified from plasmid, 

30 phage, or phagemid vectors, EST microarrays typically 

include a fair amount of vector sequence, more so when the 
probes are amplified, rather than excised, from the vector. 

In contrast, the vast majority of probes in the 
genome-derived single exon microarrays of the present 

35 invention contain no prokaryotic or bacteriophage vector 
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sequence, having been amplified directly or indirectly from 
genomic DNA. Typically, therefore, at least about 50, 60, 
70 or 80% or more of individual exon-including probes 
disposed on a genome-derived single exon microarray of the 

5 present invention lack vector sequence, and particularly 
lack sequences drawn from plasmids and bacteriophage. 
Preferably, at least about 85, 90 or more than 90% of exon- 
including probes in the genome-derived single exon 
microarray of the present invention lack vector sequence. 

10 With attention to removal of vector sequences through 

preprocessing 24, percentages of vector-free exon-including 
probes can be as high as 95 - 99%. The substantial absence 
of vector sequence from the genome-derived single exon 
microarrays of the present invention results in greater 

15 specificity during hybridization, since spurious cross- 
hybridization to a probe vector sequence is reduced. 

As a further consequence of excision or 
amplification of probes from vectors in construction of EST 
microarrays, the probes arrayed thereon often contain 

20 artificial sequence, derived from vector polylinker 

multiple cloning sites, at both 5' and 3' ends. The probes 
disposed upon the genome-derived single exon microarrays 
need have no such artificial sequence appended thereto. 

As mentioned above, however, the ORF-specific 

25 primers used to amplify putative ORFs can include 

artificial sequences, typically 5' to the ORF-specific 
primer sequence, useful for "universal" (that is, 
independent of ORF sequence) priming of subsequent 
amplification or sequencing reactions. When such 

30 "universal" 5' and/or 3' priming sequences are appended to 
the amplification primers, the probes disposed upon the 
genome-derived single exon microarray will include 
artificial sequence similar to that found in EST 
microarrays. However, the genome-derived single exon 

35 microarray of the present invention can be made without 
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such sequences, and if so constructed, presents an even 
smaller amount of nonspecific sequence that would 
contribute to nonspecific hybridization. 

Yet another consequence of typical use of cloned 

5 material as probes in EST microarrays is that such 
microarrays contain probes that result from cloning 
artifacts, such as chimeric molecules containing coding 
region of two separate genes. Derived from genomic 
material, typically not thereafter cloned, the probes of 

10 the genome-derived single exon microarrays of the present 
invention lack such cloning artifacts, and thus provide 
greater specificity of signal in gene expression 
measurements . 

A further consequence of the cloned origin of 

15 probes on many EST microarrays is that the individual 
probes often have disparate sizes, which can cause the 
optimal hybridization stringency to vary among probes on a 
single microarray. In contrast, as discussed above, the 
probes arrayed on the genome-derived single exon 

20 microarrays of the present invention can readily be 

designed to have a narrow distribution in sizes, with the 
range of probe sizes no greater than about 10% of the 
average size, typically no greater than about 5% of the 
average probe size. 

25 Because of their origin from fully- or partially- 

spliced message, probes disposed upon EST arrays will often 
include multiple exons . The percentage of such exon- 
spanning probes in an EST microarray can be calculated, on 
average, based upon the predicted number of exons/gene for 

30 the given species and the average length of the immobilized 
probes. For human genes, the near-complete sequence of 
human chromosome 22, Dunham et al. r Nature 402 (6761) : 489-95 
(1999), predicts that human genes average 5.5 exons/gene. 
Even with probes of 200 - 500 bp, the vast majority of 

35 human EST microarray probes include more than one exon. 
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In contrast, by virtue of their origin from 
algorithmically identified ORFs in genomic sequence, the 
probes in the genome-derived single exon microarrays of the 
present invention can consist of individual exons. Thus, 
5 in contrast to EST microarrays, at least about 50, 60, 70, 
75, 80, 85, 95 or 99% of probes deposited in the genome- 
derived microarray of the present invention consist of, or 
include, no more than one predicted ORF. 

This provides the ability, not readily achieved 

10 using EST microarrays, to use the genome-derived single 
exon microarrays of the present invention to measure 
tissue-specific expression of individual exons, which in 
turn allows differential splicing events to be detected and 
characterized, and in particular, allows the correlation of 

15 differential splicing to tissue-specific expression 
patterns . 

Furthermore, the exons that are represented in 
EST microarrays are often biased toward the 3' or 5 ' end of 
their respective genes, since sequencing strategies used 

20 for EST identification are so biased. In contrast, no such 
3' or 5' bias necessarily inheres in the selection of exons 
for disposition on the genome-derived single exon 
microarrays of the present invention. 

Conversely, the probes provided on the genome- 

25 derived single exon microarrays of the present invention 
typically, but need not necessarily, include intronic 
and/or intergenic sequence that is absent from EST 
microarrays, which are derived from mature mRNA. 
Typically, at least about 50, 60, 70, 80 or 90% of the 

30 exon-including probes on the genome-derived single exon 

microarrays of the present invention include sequence drawn 
from noncoding regions. As discussed above, the additional 
presence of noncoding region does not significantly 
interfere with measurement of gene expression, and provides 

35 the additional opportunity to assay prespliced RNA, and 
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thus measure such phenomena such as nuclear export control. 

The genome-derived single exon microarrays of the 
present invention are also quite different from in situ 
synthesis microarrays, where probe size is severely 
5 constrained by inadequacies in the photolithographic 
synthesis process. 

Typically, probes arrayed on in situ synthesis 
microarrays are limited to a maximum of about 25 bp. As a 
well known consequence, hybridization to such chips must be 

10 performed at low stringency. In order, therefore, to 
achieve unambiguous sequence-specific hybridization 
results, the in situ synthesis microarray requires 
substantial redundancy, with concomitant programmed 
arraying for each probe of probe analogues with altered 

15 (i.e., mismatched) sequence. 

In contrast, the longer probe length of the 
genome-derived single exon microarrays of the present 
invention allows much higher stringency hybridization and 
wash. Typically, therefore, exon-including probes on the 

20 genome-derived single exon microarrays of the present 
invention average at least about 100, 200, 300, 400 or 
500 bp in length. By obviating the need for substantial 
probe redundancy, this approach permits a higher density of 
probes for discrete exons or genes to be arrayed on the 

25 microarrays of the present invention than can be achieved 
for in situ synthesis microarrays. 

A further distinction is that the probes in in 
situ synthesis microarrays typically are covalently linked 
to the substrate surface. In contrast, the probes disposed 

30 on the genome-derived microarray of the present invention 
typically are, but need not necessarily be, bound 
noncovalently to the substrate. 

Furthermore, the short probe size on in situ 
microarrays causes large percentage differences in the 

35 melting temperature of probes hybridized to their 
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complementary target sequence, and thus causes large 
percentage differences in the theoretically optimum 
stringency across the array as a whole. 

In contrast, the larger probe size in the 
5 microarrays of the present invention create lower 

percentage differences in melting temperature across the 
range of arrayed probes. 

A further significant advantage of the 
microarrays of the present invention over in situ 
10 synthesized arrays is that the quality of each individual 
probe can be confirmed before deposition. In contrast, the 
quality of probes cannot be assessed on a probe-by-probe 
basis for the in situ synthesized microarrays presently 
being used. 

15 The genome-derived single exon microarrays of the 

present invention are also distinguished over, and present 
substantial benefits over, the genome-derived microarrays 
from lower eukaryotes such as yeast. Lashkari et al . , 
Proc. Natl. Acad. Sci. USA 94 : 13057-130 62 (1997). 

20 Only about 220 - 250 of the 6100 or so nuclear 

genes in Sacctiaromyces cerevisiae - that is, only about 4 
- 5% - have standard, spliceosomal, introns, Lopez et al., 
Nucl. Acids Res. 28:85-86 (2000); Spingola et al., RNA 
5(2):221-34 (1999). Furthermore, the entire yeast genome 

25 has already been sequenced. These two facts permit the 

ready amplification and disposition of single-ORF amplicons 
on such microarray without the requirement for antecedent 
use of gene prediction and/or comparative sequence 
analyses. 

30 Thus, a significant aspect of the present 

invention is the ability to identify and to confirm 
expression of predicted coding regions in genomic sequence 
drawn from eukaryotic organisms that have a higher 
percentage of genes having introns than do yeast such as 

35 Saccharomyces cerevisiae, particularly in genomic sequence 
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drawn from eukaryotes in which at least about 10, 20 or 50% 
of protein-encoding genes have introns. In preferred 
embodiments, the methods and apparatus of the present 
invention are used to identify and confirm expression of 
5 novel genes from genomic sequence of eukaryotes in which 
the average number of introns per gene is at least about 
one, two or three or more. 

After the physical substrate is prepared, 
experimental verification of predicted function is 
10 performed. 

In a preferred embodiment of the present 
invention, where the function sought to be identified in 
genomic sequence is protein coding, experimental 
verification is performed by measuring expression of the 

15 putative ORFs, typically through nucleic acid hybridization 
experiments, and in particularly preferred embodiments, 
through hybridization to genome-derived single exon 
microarrays prepared as above- described. 

Expression is conveniently measured and expressed 

20 for each probe in the microarray as a ratio of the 

expression measured concurrently in a plurality of mRNA 
sources, according to techniques well known in the 
microarray art, Reviewed in Schena et al., and as further 
described in Example 2, below. The mRNA source for the 

25 reference against which specific expression is measured can 
be drawn from a homogeneous mRNA source, such as a single 
cultured cell-type, or alternatively can be heterogeneous, 
as from a pool of mRNA derived from multiple tissues and/or 
cell types, as further described in Example 2, infra. 

30 mRNA can be prepared by standard techniques, see 

Ausubel et al. and Maniatis et al., or purchased 
commercially. The mRNA is then typically reverse- 
transcribed in the presence of labeled nucleotides: the 
index source (that in which expression is desired to be 

35 measured) is reverse transcribed in the presence of 
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nucleotides labeled with a first label, typically a 
fluorophore ( f luorochrome; fluor; fluorescent dye) ; the 
reference source is reverse transcribed in the presence of 
a second label, typically a fluorophore, typically 
5 f luorometrically-distinguishable from the first label. As 
further described in Example 2, infra, Cy3 and Cy5 dyes 
prove particularly useful in these methods. After partial 
purification of the index and reference targets, 
hybridization to the probe array is conducted according to 

10 standard techniques, typically under a coverslip. 

After wash, microarrays are conveniently scanned 
using a commercial microarray scanning device, such as a 
Gen3 Scanner (Molecular Dynamics, Sunnyvale, CA) . Data on 
expression is then passed, with or without interim storage, 

15 to process 500, where the results for each probe are 
related to the original sequence. 

Often, hybridization of target material to the 
genome-derived single exon microarray will identify certain 
of the probes thereon as of particular interest. Thus, it 

20 is often desirable that the user be able readily to obtain 
sufficient quantities of an individual probe, either for 
subsequent arrayed deposition upon an additional support 
substrate, often as part of a microarray having a plurality 
of probes so identified, or alternatively or additionally 

25 as a solitary solid-phase or solution-phase probe, for 
further use. 

Thus, in another aspect, the present invention 
provides compositions and kits for the ready production of 
nucleic acids identical in sequence to, or substantially 

30 identical in sequence to, probes on the genome-derived 
single exon microarrays of the present invention. 

In this aspect, a small quantity of each probe is 
disposed, typically without attachment to substrate, in a 
spatially-addressable ordered set, typically one per well 

35 of a microtiter dish. Although a 96 well microtiter plate 
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can be used, greater efficiency is obtained using, higher 
density arrays, such as are provided by microtiter plates 
having 384, 864, 1536, 3456, 6144, or 9600 wells, and 
although microtiter plates having physical depressions 
5 (wells) are conveniently used, any device that permits 
addressable withdrawal of reagent from fluidly- 
noncommunicating areas can be used. 

In this aspect of the invention, therefore, a 
fluidly noncommunicating addressable ordered set of 

10 individual probes, corresponding to those on a genome- 
derived single exon microarray, is provided, with each 
probe in sufficient quantity to permit amplification, such 
as by PCR. As earlier mentioned, the ORF-specific 
5' primers used for genomic amplification can have a first 

15 common sequence added thereto, and the ORF-specific 3' 
primers used for genomic amplification can have a second, 
different, common sequence added thereto, thus permitting, 
in this preferred embodiment, the use of a single set of 5' 
and 3' primers to amplify any one of the probes from the 

20 amplifiable ordered set. 

Each discrete amplifiable probe can also be 
packaged with amplification primers, solutes, buffers, 
etc., and can be provided in dry (e.g., lyophilized) form 
or wet, in the latter case typically with addition of 

25 agents that retard evaporation. 

In another aspect of the present invention, a 
genome-derived single-exon microarray is packaged together 
with such an ordered set of amplifiable probes 
corresponding to the probes, or one or more subsets of 

30 probes, thereon. In alternative embodiments, the ordered 
set of amplifiable probes is packaged separately from the 
genome-derived single exon microarray. 

In some embodiments, the microarray and/or 
ordered probe set are further packaged with recordable 

35 media that provide probe identification and addressing 
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information, and that can additionally contain annotation 
information, such as gene expression data. Such recordable 
media can be packaged with the microarray, with the ordered 
probe set, or with both. 

5 If the microarray is constructed on a substrate 

that incorporates recordable media, such as is described in 
international patent application no. WO 98/12559, then 
separate packaging of the genome-derived single exon 
microarray and the bioinf ormatic information is not 

10 required. 

The amount of amplifiable probe material should 
be sufficient to permit at least one amplification 
sufficient for subsequent hybridization assay. 

Although the use of high density genome-derived 

15 microarrays on solid planar substrates is presently a 
preferred approach for the physical confirmation and 
characterization of the expression of sequences predicted 
to encode protein, other types of microarrays (as herein 
defined) can also be used. 

20 Furthermore, as earlier mentioned, experimental 

verification of the function predicted from genomic 
sequence in process 200 can be bioinf ormatic, rather than, 
or additional to, physical verification. 

For example, where the function desired to be 

25 identified is protein coding, the predicted ORFs can be 

compared bioinf ormatically to sequences known or suspected 
of being expressed. 

Thus, the sequences output from process 300 (or 
process 200) , can be used to query expression databases, 

30 such as EST databases, SNP ("single nucleotide 

polymorphism") databases, known cDNA and mRNA sequences, 
SAGE ("serial analysis of gene expression") databases, and 
more generalized sequence databases that allow query for 
expressed sequences. Such query can be done by any 

35 sequence query algorithm, such as BLAST ("basic local 



WO 01/57273 PCT/US01/00664 

alignment search tool") . The results of such query — 
including information on identical sequences and 
information on nonidentical sequences that have diffuse or 
focal regions of sequence homology to the query sequence — 
5 can then be passed directly to process 500, or used to 
inform analyses subsequently undertaken in process 200, 
process 300, or process 400. 

Experimental data, whether obtained by physical 
or bioinf ormatic assay in process 400, is passed to process 

10 500 where it is usefully related to the sequence data 

itself, a process colloquially termed "annotation". Such 
annotation can be done using any technique that usefully 
relates the functional information to the sequence, as, for 
example, by incorporating the functional data into the 

15 record itself, by linking records in a hierarchical or 
relational database, by linking to external databases, or 
by a combination thereof. Such database techniques are 
well within the skill in the art. 

The annotated' sequence data can be stored 

20 locally, uploaded to genomic sequence database 100, and/or 
displayed 800. 

The methods and apparatus of the present 
invention rapidly produce functional information from 
genomic sequence. Coupled with the escalating pace at 

25 which sequence now accumulates, the rapid pace of sequence 
annotation produces a need for methods of displaying the 
information in meaningful ways. 

FIG. 3 shows visual display 80 presenting a 
single genomic sequence annotated according to the present 

30 invention. Because of its nominal resemblance to artistic 
works of Piet Mondrian, visual display 80 is alternatively 
described herein as a "Mondrian". 

Each of the visual elements of display 80 is 
aligned with respect to the genomic sequence being 

35 annotated (hereinafter, the "annotated sequence"). Given 
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the number of nucleotides typically represented in an 
annotated sequence, representation of individual 
nucleotides would rarely be readable in hard copy output of 
display 80. Typically, therefore, the annotated sequence 
5 is schematized as rectangle 89, extending from the left 
border of display 80 to its right border. By convention 
herein, the left border of rectangle 89 represents the 
first nucleotide of the sequence and the right border of 
rectangle 89 represents the last nucleotide of the 
10 sequence. 

As further discussed below, however, the Mondrian 
visual display of annotated sequence can serve as a 
convenient graphical user interface for computerized 
representation, analysis, and query of information stored 

15 electronically. For such use, the individual nucleotides 
can conveniently be linked to the X axis coordinate of 
rectangle 89. This permits the annotated sequence at any 
point within rectangle 89 readily to be viewed, either 
automatically - for example, by time-delayed appearance of 

20 a small overlaid window upon movement of a cursor or other 
pointer over rectangle 89 — or through user intervention, 
as by clicking a mouse or other pointing device at a point 
in rectangle 89. 

Visual display 80 is generated after user 

25 specification of the genomic sequence to be displayed. 

Such specification can consist of or include an accession 
number for a single clone (e.g., a single BAC accessioned 
into GenBank) , wherein the starting and stopping 
nucleotides are thus absolutely identified, or 

30 alternatively can consist of or include an anchor or 

fulcrum point about which a chosen range of sequence is 
anchored, thus providing relative endpoints for the 
sequence to be displayed. For example, the user can anchor 
such a range about a given chromosomal map location, gene 

35 name, or even a sequence returned by query for similarity 
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or identity to an input query sequence. When visual 
display 80 is used as a graphical user interface to 
computerized data, additional control over the first and 
last displayed nucleotide will typically be dynamically 
5 selectable, as by use of standard zooming and/or selection 
tools . 

Field 81 of visual display 80 is used to present 
the output from process 200, that is, to present the 
bioinf ormatic prediction of those sequences having the 

10 desired function within the genomic sequence. Functional 
sequences are typically indicated by at least one rectangle 
83 (83a, 83b, 83c), the left and right borders of which 
respectively indicate, by their X-axis coordinates, the 
starting and ending nucleotides of the region predicted to 

15 have function. 

Where a single bioinf ormatic method or approach 
identifies a plurality of regions having the desired 
function, a plurality of rectangles 83 is disposed 
horizontally in field 81. Where multiple methods and/or 

20 approaches are used to identify function, each such method 
and/or approach can be represented by its own series of 
horizontally disposed rectangles 83, each such horizontally 
disposed series of rectangles offset vertically from those 
representing the results of the other methods and 

25 approaches. 

Thus, rectangles 83a in FIG. 3 represent the 
functional predictions of a first method of a first 
approach for predicting function, rectangles 83b represent 
the functional predictions of a second method and/or second 

30 approach for predicting that function, and rectangles 83c 
represent the predictions of a third method and/or 
approach . 

Where the function desired to be identified is 
protein coding, field 81 is used to present the 
35 bioinf ormatic prediction of sequences encoding protein. 
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For example, rectangles 83a can represent the results from 
GRAIL or GRAIL II, rectangles 83b can represent the results 
from GENEFINDER, and rectangles 83c can represent the 
results from DICTION. 
5 Optionally, and preferably, rectangles 83 

collectively representing predictions of a single method 
and/or approach are identically colored and/or textured, 
and are distinguishable from the color and/or texture used 
for a different method and/or approach. 

10 Alternatively, or in addition, the color, hue, 

density, or texture of rectangles 83 can be used further to 
report a measure of the bioinf ormatic reliability of the 
prediction. For example, many gene prediction programs 
will report a measure of the reliability of prediction. 

15 Thus, increasing degrees of such reliability can be 

indicated, e.g., by increasing density of shading. Where 
display 80 is used as a graphical user interface, such 
measures of reliability, and indeed all other results 
output by the program, can additionally or alternatively be 

20 made accessible through linkage from individual rectangles 
83, as by time-delayed window ("tool tip" window), or by 
pointer (e.g., mouse ) -activated link. 

As earlier described, increased predictive 
reliability can be achieved by requiring consensus among 

25 methods and/or approaches to determining function. Thus, 
field 81 can include a horizontal series of rectangles 83 
that indicate one or more degrees of consensus in 
predictions of function. 

Although FIG. 3 shows three series of 

30 horizontally disposed rectangles in field 81, display 80 
can include as few as one such series of rectangles and as 
many as can discriminably be displayed, depending upon the 
number of methods and/or approaches used to predict a given 
function . 

35 Furthermore, field 81 can be used to show 
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predictions of a plurality of different functions. 
However, the increased visual complexity occasioned by such 
display makes more useful the ability of the user to select 
a single function for display. When display 80 is used as 
5 a graphical user interface for computer query and analysis, 
such function can usefully be indicated and user- 
selectable, as by a series of graphical buttons or tabs 
(not shown in FIG. 3) . 

Rectangle 8 9 is shown in FIG. 3 as including 

10 interposed rectangle 84. Rectangle 84 represents the 
portion of annotated sequence for which predicted 
functional information has been assayed physically, with 
the starting and ending nucleotides of the assayed material 
indicated by the X axis coordinates of the left and right 

15 borders of rectangle 84. Rectangle 85, with optional 
inclusive circles 86 (86a, 86b, and 86c) displays the 
results of such physical assay. 

Although a single rectangle 84 is shown in FIG. 
3, physical assay is not limited to just one region of 

20 annotated genomic sequence. It is expected that an 

increasing percentage of regions predicted to have function 
by process 200 will be assayed physically, and that display 
80 will accordingly, for any given genomic sequence, have 
an increasing number of rectangles 84 and 85, representing 

25 an increased density of sequence annotation. 

Where the function desired to be identified is 
protein coding, rectangle 84 identifies the sequence of the 
probe used to measure expression. In embodiments of the 
present invention where expression is measured using 

30 genome-derived single exon microarrays, rectangle 84 
identifies the sequence included within the probe 
immobilized on the support surface of the microarray. As 
noted supra, such probe will often include a small amount 
of additional, synthetic, material incorporated during 

35 amplification and designed to permit reamplif ication of the 
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probe, which sequence is typically not shown in display 80. 

Rectangle 87 is used to present the results of 
bioinf ormatic assay of the genomic sequence. For example, 
where the function desired to be identified is protein 
5 coding, process 400 can include bioinf ormatic query of 
expression databases with the sequences predicted in 
process 200 to encode exons. And as earlier discussed, 
because bioinf ormatic assay presents fewer constraints than 
does physical assay, often the entire output of process 200 

10 can be used for such assay, without further subsetting 

thereof by process 300. Therefore, rectangle 87 typically 
need not have separate indicators therein of regions 
submitted for bioinf ormatic assay; that is, rectangle 87 
typically need not have regions therein analogous to 

15 rectangles 84 within rectangle 89. 

Rectangle 87 as shown in FIG. 3 includes smaller 
rectangles 880 and 88. Rectangles 880 indicate regions 
that returned a positive result in the bioinf ormatic assay, 
with rectangles 88 representing regions that did not return 

20 such positive results. Where the function desired to be 
predicted and displayed is protein coding, rectangles 880 
indicate regions of the predicted exons that identify 
sequence with significant similarity in expression 
databases, such as EST, SNP, SAGE databases, with 

25 rectangles 88 indicating genes novel over those identified 
in existing expression data bases. 

Rectangles 880 can further indicate, through 
color, shading, texture, or the like, additional 
information obtained from bioinf ormatic assay. 

30 For example, where the function assayed and 

displayed is protein coding, the degree of shading of 
rectangles 880 can be used to represent the degree of 
sequence similarity found upon query of expression 
databases. The number of levels of discrimination can be 

35 as few as two (identity, and similarity, where similarity 
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has a user-selectable lower threshold) . Alternatively, as 
many different levels of discrimination can be indicated as 
can visually be discriminated. 

Where display 80 is used as a graphical user 
5 interface, rectangles 880 can additionally provide links 
directly to the sequences identified by the query of 
expression databases, and/or statistical summaries thereof. 
As with each of the precedingly-discussed uses of display 
80 as a graphical user interface, it should be understood 
10 that the information accessed via display 80 need not be 
resident on the computer presenting such display, which 
often will be serving as a client, with the linked 
information resident on one or more remotely located 
servers . 

15 Rectangle 85 displays the results of physical 

assay of the sequence delimited by its left and right 
borders . 

Rectangle 85 can consist of a single rectangle, 
thus indicating a single assay, or alternatively, and 

20 increasingly typically, will consist of a series of 

rectangles (85a, 85b, 85c) indicating separate physical 
assays of the same sequence. 

Where the function assayed is gene expression, 
and where gene expression is assayed as herein described 

25 using simultaneous two-color fluorescent detection of 

hybridization to genome-derived single exon microarrays, 
individual rectangles 85 can be colored to indicate the 
degree of expression relative to control. Conveniently, 
shades of green can be used to depict expression in the 

30 sample over control values, and shades of red used to 

depict expression less than control, corresponding to the 
spectra of the Cy3 and Cy5 dyes conventionally used for 
respective labeling thereof. Additional functional 
information can be provided in the form of circles 86 (86a, 

35 86b, 86c) , where the diameter of the circle can be used to 
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indicate expression intensity. As discussed infra, such 
relative expression (expression ratios) and absolute 
expression (signal intensity) can be expressed using 
normalized values. 
5 Where display 80 is used as a graphical user 

interface, rectangle 85 can be used as a link to further 
information about the assay. For example, where the assay 
is one for gene expression, each rectangle 85 can be used 
to link to information about the source of the hybridized 

10 mRNA, the identity of the control, raw or processed data 
from the microarray scan, or the like. 

FIG. 4 is rendition of display 80 representing 
gene prediction and gene expression for a hypothetical BAC, 
showing conventions used in the Examples presented infra. 

15 BAC sequence ("Chip seq." ) 89 is presented, with the 
physically assayed region thereof (corresponding to 
rectangle 84 in FIG. 3) shown in white. Algorithmic gene 
predictions are shown in field 81, with predictions by 
GRAIL shown, predictions by GENEFINDER, and predictions by 

20 DICTION shown. Within rectangle 87, regions of sequence 
that, when used to query expression databases, return 
identical or similar sequences ("EST hit") are shown as 
white rectangles (corresponding to rectangles 880 in FIG. 
3) , gray indicates low homology, and black indicates 

25 unknowns (where black and gray would correspond to 
rectangles 88 in FIG. 3) . 

Although FIGS. 3 and 4 show a single stretch of 
sequence, uninterrupted from left to right, longer 
sequences are usefully represented by vertical stacking of 

30 such individual Mondrians, as shown in FIGS. 9 and 10. 

Single Exon Probes Useful For Measuring Gene Expression 

The methods and apparatus of the present 
35 invention rapidly produce functional information from 
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■ genomic sequence. Where the function to be identified is 
protein coding, the methods and apparatus of the present 
invention rapidly identify and confirm the expression of 
portions of genomic sequence that function to encode 

5 protein. As a direct result, the methods and apparatus of 
the present invention rapidly yield large numbers of 
single-exon nucleic acid probes, the majority from 
previously unknown genes, each of which is useful for 
measuring and/or surveying expression of a specific gene in 

10 one or more tissues or cell types. 

It is, therefore, another aspect of the present 
invention to provide .genome-derived single exon nucleic 
acid probes useful for gene expression analysis, and 
particularly for gene expression analysis by microarray. 

15 Using the methods and genome-derived single-exon 

microarrays of the present invention, we have for example 
readily identified a large number of unique ORFs from human 
genomic sequence. Using single exon probes that encompass 
these ORFs, we have demonstrated, through microarray 

20 hybridization analysis, the expression of 13,109 of these 
ORFs in adult liver. 

As would immediately be appreciated by one of 
skill in the art, each single exon probe having 
demonstrable expression in adult liver is currently 

25 available for use in measuring the level of its ORF's 
expression in adult liver. 

Diseases of the liver are a significant cause of 
human morbidity and mortality. Increasingly, genetic 
factors are being found that contribute to predisposition, 

30 onset, and/or aggressiveness of most, if not all, of these 
diseases; although causative mutations in single genes have 
been identified for some, these disorders are believed for 
the most part to have polygenic etiologies. 

For example, cirrhosis is a major public health 

35 problem. In the industrialized world, it is among the top 
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ten causes of death; among patients aged 45 to 65, it is 
the third leading cause of death. The high prevalence is 
largely the result of alcohol abuse, but other major 
contributors include chronic hepatitis, biliary disease and 
5 iron overload. Approximately 10-15% are cryptogenic. 

Cirrhosis is a broad description encompassing the 
common end stage of many forms of liver injury. Many 
patients with cirrhosis will remain asymptomatic for years, 
while others show generalized weakness, anorexia, malaise, 

10 and weight loss or, occasionally, more severe symptoms. 

The progression from fibrosis, an early 
consequence of liver disease, to cirrhosis, and the 
specific histologic morphology that characterizes cirrhosis 
depend on the extent of injury, the presence of continuing 

15 damage, and the response of the liver to damage. The liver 
may be injured acutely and severely (e.g. necrosis with 
hepatitis), moderately over months or years (e.g. biliary 
tract obstruction and chronic active hepatitis), or 
modestly but continuously (e.g. alcohol abuse) . 

20 , During the repair process, new vessels 

connecting the hepatic artery and portal vein to the 
hepatic venules form within the fibrous sheath that 
surrounds the surviving nodules of liver cells. These 
vessels restore the intrahepatic circulatory pathway, but 

25 provide relatively low-volume, high-pressure drainage that 
is less efficient than normal and results in increased 
portal vein pressure (portal hypertension) . Thus, 
cirrhosis is not static and its features depend on the 
disease activity and stage. 

30 As cirrhosis is the end stage of many forms of 

liver disease, many genes have been identified that can 
contribute to the development of cirrhosis. These include, 
e.g., the genes responsible for Wilson disease (Online 
Mendelian Inheritance of Man ("OMIM") 277900), type IV 

35 glycogen storage disease (OMIM 232500), galactosemia (OMIM 
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230400), and a deficiency of alpha-l-antitrypsin (OMIM 
107400) . There is substantial evidence, however, for as 
yet uncharacterized loci which cause cirrhosis. 

For example, Iber and Maddrey, Prog. Liver Dis. 
5 2: 290-302 (1965), reviewed 13 previously reported families 
and 8 new to this study, each with 2 or more affected 
members. They pointed out that, with a single exception, 
the multiple cases were in the same generation. Within a 
given family, the age of onset, clinical course, and biopsy 

10 findings were very similar, but there were wide differences 
between families. 

Kalra et al., Hum. Hered. 32:170-175 (1982) 
studied the families of 220 cases of Indian childhood 
cirrhosis and 70 families of age-matched controls. The 

15 hypotheses of autosomal recessive, partial sex-linkage, and 
doubly recessive inheritance were found untenable and the 
authors concluded that multifactorial inheritance was most 
plausible. Lefkowitch et al., New Eng. J. Med. 307:271-277 
(1982) described 4 white American sibs who died between 

20 ages 4.5 and 6 years of cirrhosis that closely resembled 
that of the childhood cirrhosis of Asiatic Indians. 

Another example of uncharacterized loci which 
cause cirrhosis are those related to the risk of 
alcoholism. 

25 Cloninger, Science 236:410-416 (1987), defined 

two separate types of alcoholism. According to these 
definitions, type 1 alcohol abuse has its usual onset after 
the age of 25 years and is characterized by severe 
psychological dependence and guilt. Type 1 occurs in both 

30 men and women and requires both genetic and environmental 
factors to become manifest. By contrast, type 2 alcohol 
abuse has its onset before the age of 25; persons with this 
type of alcoholism are characterized by their inability to 
abstain from alcohol and by frequent aggressive and 

35 antisocial behavior. Type 2 alcoholism is rarely found in 
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women and is much more heritable. 

Despite considerable effort to identify genes 
related to the risk of alcoholism, relatively few genes 
have been identified. Some of this work has suggested a 
5 relationship between the metabolism of dopamine and 

alcoholism. Blum et al . , J. A.M. A. 263:2055-2060 (1990) and 
Bolos et al., J. A.M. A. 264:3156-3160 (1990) investigated 
the relationship of the dopamine D2 receptor ( DRD2 ; OMIM 
126450) to alcoholism, but the sample size was small and 

10 their results were inconclusive. However, Tiihonen et al . , 
Molec. Psychiat. 4, 286-289 (1999), found a markedly higher 
frequency in a population of type 1 alcoholics of the low 
activity allele of the enzyme catechol-O-methyltransf erase 
(COMT, OMIM 116790), which has a crucial role in the 

15 metabolism of dopamine, suggesting a role for dopamine 
metabolism in increased risk of • alcoholism. For a brief 
review of recent progress toward the identification of 
genes related to risk for alcoholism see Buck, Genome 
9:927-928 (1998). 

20 As another example, multiple genes have been 

shown to predispose to hyperlipoproteinemia or 
hyperlipidemia . Much attention has been focused on these 
disorders because there is a strong association of 
hyperlipidemia, especially hypercholesterolemia, with 

25 development of coronary artery disease. Coronary artery 
disease accounts for at least 25% of all deaths in the 
United States. Coronary artery disease results when the 
arteries supplying the heart muscle become occluded by 
plaques composed of lipids like cholesterol, blood clotting 

30 components and blood cells. 

The major plasma lipids circulate bound to 
proteins as macromolecular complexes called lipoproteins. 
Although closely interrelated, the major lipoprotein 
classes - chylomicron, very-low-density lipoprotein (VLDL) , 

35 low-density lipoprotein (LDL) , and high-density lipoprotein 
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(HDL) - are usually classified in terms of physico-chemical 
properties (e.g., density after centrif ugation ) . 
Chylomicrons, the largest lipoproteins, carry exogenous 
triglyceride from the intestine via the thoracic duct to 
5 the venous system and into peripheral sites. VLDL carries 
endogenous triglyceride primarily from the liver to the 
same peripheral sites for storage or use. Lipases quickly 
degrade the triglyceride in VLDL to produce intermediate 
density lipoproteins (IDL) and within 2 to 6 h, IDL is 

10 degraded further to generate LDL, which has a plasma half- 
life of 2 to 3 days. While the overall fate of LDL is 
unclear, the liver is responsible for removing 
approximately 70% and active receptor sites have been found 
on the surfaces of hepatocytes. 

15 Several monogenic conditions that lead to 

elevated levels of one or more serum lipoproteins have been 
defined and the responsible gene identified, including, 
e.g., hyperlipoproteinemia type I (OMIM 238600), familial 
hypercholesterolemia (OMIM 143890), and familial defective 

20 apolipoprotein B (OMIM 107730) . However, in many cases the 
etiology is unknown and there is strong evidence for 
additional uncharacterized loci. 

For example, Zuliani et al., Arterioscler . 
Thromb. Vase. Biol. 19:802-809 (1999) identified a 

25 Sardinian family with a recessive form of 

hypercholesterolemia with the clinical features of familial 
hypercholesterolemia (OMIM 603813), and found that 
previously identified genes were not responsible for this 
disorder. They proposed that in this new lipid disorder, a 

30 recessive defect causes a selective impairment of the LDL 
receptor function in the liver. Ciccarese et al., Am. J. 
Hum. Genet. 66:453-460 (2000) recently mapped this novel 
disease locus. 

Another example is designated familial combined 

35 hyperlipidemia (OMIM 144250) which affects approximately 1- 
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2% of the population in the Western world. This disorder 
can have its basis in mutation in several novel genes, two 
of which have been mapped to chromosome 1 (Pajukanta et 
al., Nature Genet. 18:369-373 (1998)) and chromosome 11 
5 (Aouizerat et al., Am. J. Hum. Genet. 65, 397-412 (1999)). 
The high frequency of this disorder suggests that most, if 
not all, hyperlipidemias are of multifactorial genetic 
etiology. 

As yet a further example, primary schlerosing 
10 cholangitis (PSC) is a disorder characterized by a patchy 
obliterative inflammatory fibrosis of the large bile ducts. 
Chronic inflammation leads to extensive bile duct 
strictures, cholestasis, and gradual progression to biliary 
cirrhosis. PSC occurs most often in young men and is 
15 commonly associated with inflammatory bowel disease, 
especially ulcerative colitis. The onset is usually 
insidious, with gradual, progressive fatigue, pruritus, and 
jaundice. There is no specific therapy for sclerosing 
cholangitis, and liver transplantation is the only apparent 
20 cure. 

The etiology of PSC is not known, but both 
genetic and immunologic abnormalities have been implicated. 
However, the frequency of HLA-B8 and HLA-DT2, which are 
associated with a number of autoimmune diseases, is higher 
25 in PSC than normal individuals. Prochazka et al . , New Eng. 
J. Med. 322:1842-1844 (1990) found that 100% of 29 patients 
with primary sclerosing cholangitis carried the HLA-DRw52a 
antigen, which is normally present in 35% of the 
population . 

30 As a still further example, sarcoidosis is a 

disease of unknown cause characterized by non-caseating 
granulomas in one or more organ systems. These granulomas 
may resolve completely or proceed to fibrosis. The disorder 
is systemic, but the liver is affected in approximately 75% 

35 of cases. Sarcoidosis occurs mainly in persons aged 20 to 
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4 0 yr and is most common in Northern Europeans and American 
blacks. The lifetime risk of developing sarcoidosis is 
particularly high among Swedish men (1.15%), Swedish women 
(1.6%), and African Americans (2.4%). 
5 The much greater frequency in African Americans 

relative to the United States population overall suggests a 
genetic contribution to etiology. Early research studying 
familial aggregation indicated that the disease may have a 
nongenetic basis because the family pattern did not conform 

10 to a simple Mendelian mode of inheritance (Allison, Sth. 
Med. J. 57: 27-32 (1964)). However, Headings et al . , Ann. 
N.Y. Acad. Sci. 278:377-385 (1976) favored multifactorial 
genetic inheritance of susceptibility. Nowack et al . , 
Arch. Intern. Med. 147:481-483 (1987), found an unusually 

15 high frequency of HLA-DR5 in a study of 440 patients with 
sarcoidosis in Marburg, Germany. They also concluded that 
the role of an environmental or infectious agent triggering 
sarcoidosis cannot be envisaged without considering 
genetically linked cof actors. 

20 Other significant diseases of liver are also 

believed to have a genetic, typically polygenic, etiologic 
component. These diseases include, e.g., primary biliary 
cirrhosis, Zellweger syndrome, cholestasis-lymphedema 
syndrome, Alstrom syndrome, primary pulmonary 

25 hypertension, Berardinelli-Seip congenital lipodystrophy, 
iron overload in Africa, neonatal cholestatic hepatitis, 
autosomal recessive KID syndrome, familial 
hypotransf errinemia, type I congenital dyserythropoietic 
anemia, porphyria variegata, Finnish lactic acidosis with 

30 hepatic hemosiderosis, Rotor syndrome, essential 
hypertension, ARC syndrome, type II conjugated 
hyperbilirubinemia, Lambert syndrome, ichthyosis congenita 
with biliary atresia, Kabuki make-up syndrome, Meckel 
syndrome, cerebral aneurysm-cirrhosis syndrome, glycogen 

35 storage diseases, polycystic kidney and hepatic disease, 
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isolated Caroli disease, trisomy 18-like syndrome, Osler- 
Rendu-Weber syndrome 3, fatal intrahepatic cholestasis, 
Coach syndrome, type C Niemann-Pick disease, and 
hepatocellular cancer. 
5 Altered responses to a variety of infectious 

agents that target the liver, especially acute viral 
hepatitis, have also been shown or are suspected to have 
genetic bases or contributions. In addition to 
differential susceptibility to primary infectious agents, 

10 these altered responses include predisposition to 

complicating conditions following contact with particular 
infectious agents. These include, e.g., development of 
hepatocellular carcinoma 2 correlated with Hepatitis B 
infection, and severe hepatic fibrosis following 

15 Schistosoma mansoni infection. 

The central role of the liver in drug metabolism 
results in exposure of this organ to a large variety of 
potentially toxic chemical agents and metabolites. These 
include naturally occurring plant alkaloids and mycotoxins, 

20 industrial chemicals, and, additionally, pharmacologic 
agents used in treating disease. The range of 
manifestations of toxin- and drug-induced liver disease are 
virtually as broad as the range of acute and chronic 
disorders and have also been shown or suspected to have 

25 genetic bases or contributions. 

Such interactions between drugs and genotype have 
been shown in the response, e.g., to the anticonvulsant 
phenytoin, which can cause severe hepatitis-like disease in 
individuals who are impaired in the ability to detoxify a 

30 metabolite of phenytoin in the liver, and in the response 
to the drug sodium valproate, which can produce severe 
hepatotoxicity in certain individuals. The abnormal 
responses 'to both of these drugs are believed to be 
influenced by underlying genetic factors. 

35 The human genome-derived single exon nucleic acid 

62 



WO 01/57273 PCT/US01/00664 

probes and microarrays of the present invention are useful 
for predicting, diagnosing, grading, staging, monitoring 
and prognosing diseases of human liver, particularly those 
diseases with polygenic etiology. With each of the single 
5 exon probes described herein shown to be expressed at 

detectable levels in human liver, and with about 2/3 of the 
probes identifying novel genes, the single exon microarrays 
of the present invention provide exceptionally high 
informational content for such studies. 

10 For example, diagnosis (including differential 

diagnosis among clinically indistinguishable disorders, 
such as cirrhosis) , staging, and/or grading of a disease 
can be based upon the quantitative relatedness of a patient 
gene expression profile to one or more reference expression 

15 profiles known to be characteristic of a given liver 
disease, or to specific grades or stages thereof. 

In one embodiment, the patient gene expression 
profile is generated by hybridizing nucleic acids obtained 
directly or indirectly from transcripts expressed in the 

20 patient ' s liver to the genome-derived single exon 

microarray of the present invention. Reference profiles 
are obtained similarly, using nucleic acids obtained 
directly or indirectly from transcripts expressed by liver 
of individuals with known liver disease. Methods for 

25 quantitatively relating gene expression profiles, without 
regard to the function of the protein encoded by the gene, 
are disclosed in WO 99/58720, incorporated herein by 
reference in its entirety. 

In another approach, the genome-derived single 

30 exon probes and microarrays of the present invention can be 
used to interrogate genomic DNA, rather than pools of 
expressed message; this latter approach permits 
predisposition to and/or prognosis of liver disease to be 
assessed through the massively parallel determination of 

35 altered copy number, deletion, or mutation in the patient's 
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genome of exons known to be expressed in human liver. The 
algorithms set forth in WO 99/58720 can be applied to such 
genomic profiles without regard to the function of the 
protein encoded by the interrogated gene. 
5 The utility is specific to the probe; at 

sufficiently high hybridization stringency, which 
stringencies are well known in the art —see Ausubel et al. 
and Maniatis et al. — each probe reports the level of 
expression of message specifically containing that ORF. 

10 It should be appreciated, however, that the 

probes of the present invention, for which expression in 
the adult liver has been demonstrated are useful for both 
measurement in the adult liver and for survey of expression 
in other tissues.' 

15 Significant among such advantages is the presence 

of probes for novel genes. 

As mentioned above and further detailed in 
Examples 1 and 2, the methods described enable ORFs which 
are not present in existing expression databases to be 

20 identified. And the fewer the number of tissues in which 
the ORF can be shown to be expressed, the more likely the 
ORF will prove to be part of a novel gene: as further 
discussed in Example 2, ORFs whose expression was 
measurable in only a single of the tested tissues were 

25 represented in existing expression databases at a rate of 
only 11%, whereas 36% of ORFs whose expression was 
measurable in 9 tissues were present in existing expression 
databases, and fully 45% of those ORFs expressed in all ten 
tested tissues were present in existing expressed sequence 

30 databases. 

Either as tools for measuring gene expression or 
tools for surveying gene expression, the genome-derived 
single exon probes of the present invention have 
significant advantages over the cDNA or EST-based probes 
35 that are currently available for achieving these utilities. 
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The genome-derived single exon probes of the 
present invention are useful in constructing genome-derived 
single exon microarrays ; the genome-derived single exon 
microarrays, in turn, are useful devices for measuring and 
5 for surveying gene expression in the human. 

Gene expression analysis using microarrays — 
conventionally using microarrays having probes derived from 
expressed message — is well-established as useful in the 
biological research arts (see Lockhart et al. Nature 405, 
10 827-836). 

Microarrays have been used to determine gene 
expression profiles in cells in response to drug treatment 
(see, for example, Kaminski et al., "Global Analysis of 
Gene Expression in Pulmonary Fibrosis Reveals Distinct 

15 Programs Regulating Lung Inflammation and Fibrosis," Proc. 
Natl. Acad. Sci. USA 97 (4 ): 1778-83 (2000); Bartosiewicz et 
al., "Development of a Toxicological Gene Array and 
Quantitative Assessment of This Technology," Arch. Biochem. 
Biophys. 376(1): 66-73 (2000)), viral infection (see for 

20 example, Geiss et al., "Large-scale Monitoring of Host Cell 
Gene Expression During HIV-1 Infection Using cDNA 
Microarrays," Virology 266 (1) : 8-16 (2000)) and during cell 
processes such as differentiation, senescence and apoptosis 
(see, for example, Shelton et al . , "Microarray Analysis of 

25 Replicative Senescence," Curr. Biol. 9(17): 939-45 (1999); 
Voehringer et al., "Gene Microarray Identification of Redox 
and Mitochondrial Elements That Control Resistance or 
Sensitivity to Apoptosis," Proc. Natl. Acad. Sci. USA 
97(6):2680-5 (2000)). 
' 30 Microarrays have also been used to determine 

abnormal gene expression in diseased tissues (see, for 
.example, Alon et al., "Broad Patterns of Gene Expression 
Revealed by Clustering Analysis of Tumor and Normal Colon 
Tissues Probed by Oligonucleotide Arrays," Proc. Natl. 

35 Acad. Sci. USA 9 6 ( 12 ): 674 5-50 (1999); Perou et al., 
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"Distinctive Gene Expression Patterns in Human Mammary 
Epithelial Cells and Breast Cancers, Proc. Natl. Acad. Sci. 
USA 96 (16) : 9212-7 (1999); Wang et al., "Identification of 
Genes Differentially Over-expressed in Lung Squamous Cell 
5 Carcinoma Using Combination of cDNA Subtraction and 
Microarray Analysis, " Oncogene 19 ( 12 ): 1519-28 (2000); 
Whitney et al., "Analysis of Gene Expression in Multiple 
Sclerosis Lesions Using cDNA Microarrays, " Ann. Neurol. 
46(3): 425-8 (1999)), in drug discovery screens (see, for 

10 example, Scherf et al., "A Gene Expression Database for the 
Molecular Pharmacology of Cancer," Nat. Genet. 24(3):236-44 
(2000) ) and in diagnosis to determine appropriate treatment 
strategies (see, for example, Sgroi et al., "In vivo Gene 
Expression Profile Analysis of Human Breast Cancer ' 

15 Progression," Cancer Res . 59 (22 ): 5656-61 (1999)). 

In microarray-based gene expression screens of 
pharmacological drug candidates upon cells, each probe 
provides specific useful data. In particular, it should be 
appreciated that even those probes that show no change in 

20 expression are as informative as those that do change, 
serving, in essence, as negative controls. 

For example, where gene expression analysis is 
used to assess toxicity of chemical, agents on cells, the 
failure of the agent to change a gene's expression level is 

25 evidence that the drug likely does not affect the pathway 
of which the gene's expressed protein is a part. 
Analogously, where gene expression analysis is used to 
assess side effects of pharmacological agents — whether in 
lead compound discovery or in subsequent screening of lead 

30 compound derivatives - the inability of the agent to alter 
a gene's expression level is evidence that the drug does 
not affect the pathway of which the gene's expressed 
protein is a part. 

WO 99/58720 provides methods for quantifying the 

35 relatedness of a first and second gene expression profile 
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and for ordering the relatedness of a plurality of gene 
expression profiles. The methods so described permit 
useful information to be extracted from a greater 
percentage of the individual gene expression measurements 
5 from a microarray than methods previously used in the art. 

Other uses of microarrays are described in 
Gerhold et al., Trends Biochem. Sci . 24 (5) : 168-173 (1999) 
and Zweiger, Trends Biotechnol. 17 (11) : 429-436 (1999); 
Schena et al. 

10 The invention particularly provides genome- 

derived single-exon probes known to be expressed in adult 
liver. The individual single exon probes can be 

provided in the form of substantially isolated and purified 
nucleic acid, typically, but not necessarily, in a quantity 

15 sufficient to perform a hybridization reaction. 

Such nucleic acid can be in any form directly 
hybridizable to the message that contains the probe's ORF, 
such as double stranded DNA, single-stranded DNA 
complementary to the message, single-stranded RNA 

20 complementary to the message, or chimeric DNA/ RNA molecules 
so hybridizable. The nucleic acid can alternatively or 
additionally include either nonnative nucleotides, 
alternative internucleotide linkages, or both, so long as 
complementary binding can be obtained. For example, probes 

25 can include phosphorothioates , methylphosphonates, 

morpholino analogs, and peptide nucleic acids ( PNA) , as are 
described, for example, in U.S. Patent Nos. 5,142,047; 
5,235,033; 5,166,315; 5,217,866; 5,184,444; 5,861,250. 

Usefully, however, such probes are provided in a 

30 form and quantity suitable for amplification, where the 
amplified product is thereafter to be used in the 
hybridization reactions that probe gene expression. 
Typically, such probes are provided in a form and quantity 
suitable for amplification by PCR or by other well known 

35 amplification technique. One such technique additional to 
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PCR is rolling circle amplification, as is described, inter 
alia, in U.S. Patent Nos. 5,854,033 and 5,714,320 and 
international patent publications WO 97/19193 and 
WO 00/15779. As is well understood, where the probes are 

5 to be provided in a form suitable for amplification, the 
range of nucleic acid analogues and/or internucleotide 
linkages will be constrained by the requirements and nature 
of the amplification enzyme. 

Where the probe is to be provided in form 

10 suitable for amplification, the quantity need not be 

sufficient for direct hybridization for gene expression 
analysis, and need be sufficient only to function as an 
amplification template, typically at least about 1, 10 or 
100 pg or more. 

15 Each discrete amplifiable probe can also be 

packaged with amplification primers, either in a single 
composition that comprises probe template and primers, or 
in a kit that comprises such primers separately packaged 
therefrom. As earlier mentioned, the ORF-specific 

20 5' primers used for genomic amplification can have a first 
common sequence added thereto, and the ORF-specific 3 ' 
primers used for genomic amplification can have a second, 
different, common sequence added thereto, thus permitting, 
in this embodiment, the use of a single set of 5' and 3' 

25 primers to amplify any one of the probes. The probe 

composition and/or kit can also include buffers, enzyme, 
etc., required to effect amplification. 

As mentioned earlier, when intended for use on a 
genome-derived single exon microarray of the present 

30 invention, the genome-derived single exon probes of the 
present invention will typically average at least about 
100, 200, 300, 400 or 500 bp in length, including (and 
typically, but not necessarily centered about) the ORF. 
Furthermore, when intended for use on a genome-derived 

35 single exon microarray of the present invention, the 
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genome-derived single exon probes of the present invention 
will typically not contain a detectable label. 

When intended for use in solution phase 
hybridization, however — that is, for use in a 

5 hybridization reaction in which the probe is not first 
bound to a support substrate (although the target may 
indeed be so bound) — length constraints that are imposed 
in microarray-based hybridization approaches will be 
relaxed, and such probes will typically be labeled. 

10 In such case, the only functional constraint that 

dictates the minimum size of such probe is that each such 
probe must be capable of specifically identifying in a 
hybridization reaction the exon from which it is drawn. In 
theory, a probe of as little as 17 nucleotides is capable 

15 of uniquely identifying its cognate sequence in the human 
genome. For hybridization to expressed message — a subset 
of target sequence that is much reduced in complexity as 
compared to genomic sequence - even fewer nucleotides are 
required for specificity. 

20 Therefore, the probes of the present invention 

can include as few as 20, 25 or 50 bp or ORF, or more. In 
particular embodiments, the ORF sequences are given in SEQ 
ID NOS. 13,110 - 25,995, respectively, for probe SEQ ID 
NOS. 1 - 13,109. The minimum amount of ORF required -to be 

25 included in the probe of the present invention in order to 
provide specific signal in either solution phase or 
microarray-based hybridizations can readily be determined 
for each of ORF SEQ ID NOS. 13,110 - 25,995 individually by 
routine experimentation using standard high stringency 

30 conditions. 

Such high stringency conditions are described, 
inter alia, in Ausubel et al. and Maniatis et al. For 
microarray-based hybridization, standard high stringency 
conditions can usefully be 50% formamide, 5X SSC, 0.2 ug/ul 

35 poly(dA), 0.2 ug/ul human c D tl DNA, and 0.5 % SDS, in a 
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humid oven at 42°C overnight, followed by successive washes 
of the microarray in IX SSC, 0.2% SDS at 55°C for 5 
minutes, and then 0 . IX SSC, 0.2% SDS, at 55°C for 20 
minutes. For solution phase hybridization, standard high 

5 stringency conditions can usefully be aqueous hybridization 
at 65°C in 6X SSC. Lower stringency conditions, suitable 
for cross-hybridization to mRNA encoding structurally- and 
functionally-related proteins, can usefully be the same as 
the high stringency conditions but with reduction in 

10 temperature for hybridization and washing to room 
temperature (approximately 25°C) . 

When intended for use in solution phase 
hybridization, the maximum size of the single exon probes 
of the present invention is dictated by the proximity of 

15 .other expressed exons in genomic DNA: although each single 
exon probe can include intergenic and/or intronic material 
contiguous to the ORF in the human genome, each probe of 
the present invention will include portions of only one 
expressed exon. 

20 Thus, each single exon probe will include no more 

than about 25 kb of contiguous genomic sequence, more 
typically no more than about 20 kb of contiguous genomic 
sequence, more usually no more than about 15 kb, even more 
usually no more than about 10 kb. Usually, probes that are 

25 maximally about 5 kb will be used, more typically no more 
than about 3 kb. 

It will be appreciated that the Sequence Listing 
appended hereto presents, by convention, only that strand 
of the probe and ORF sequence that can be directly 

30 translated reading from 5' to 3' end. As would be well 
understood by one of skill in the art, single stranded 
probes must be complementary in sequence to the ORF as 
present in an mRNA; it is well within the skill in the art 
to determine such complementary sequence. It will further 

35 be understood that double stranded probes can be used in 

70 



WO 01/57273 PCT/US01/00664 

both solution-phase hybridization and microarray-based 
hybridization if suitably denatured. 

Thus, it is an aspect of the present invention to 
provide single-stranded nucleic acid probes that have 
5 sequence complementary to those described herein above and 
below, and double-stranded probes one strand of which has 
sequence complementary to the probes described herein. 

The probes can, but need not, contain intergenic 
and/or intronic material that flanks the ORF, on one or 

10 both sides, in the same linear relationship to the ORF that 
the intergenic and/or intronic material bears to the ORF in 
genomic DNA. The probes do not, however, contain nucleic 
acid derived from more than one expressed ORF. 

And when intended for use in solution 

15 hybridization, the probes of the present invention can 

usefully have detectable labels. Nucleic acid labels are 
well known in the art, and include, inter alia, radioactive 
labels, such as 3 H, 32 P, 33 P, 35 S, 125 I, 131 I; fluorescent 
labels, such as Cy3, Cy5, Cy5.5, Cy7, SYBR® 

20 Green, and other labels described in Haugland, 

Handbook of Fluorescent Probes and Research Chemicals, 7th 
ed., Molecular Probes Inc., Eugene, OR (2000), or 
fluorescence resonance energy transfer tandem conjugates 
thereof; labels suitable for chemiluminescent and/or 

25 enhanced chemiluminescent detection; labels suitable for 
ESR and NMR detection; and 'labels that include one member 
of a specific binding pair, such as biotin, digoxigenin, or 
the like. 

The probes, either in quantity sufficient for 
30 hybridization or sufficient for amplification, can be 
provided in individual vials or containers. 

Alternatively, such probes can usefully be 
packaged as a plurality of such individual genome-derived 
single exon probes. 
35 When provided as a collection of plural 
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individual probes, the probes are typically made available 
in amplif iable form in a spatially-addressable ordered set, 
typically one per well of a microtiter dish. Although a 96 
well microtiter plate can be used, greater efficiency is 

5 obtained using higher density arrays. 

If, as earlier mentioned, the ORF-specific 
5' primers used for genomic amplification had a first 
common sequence added thereto, and the ORF-specific 3' 
primers used for genomic amplification had a second, 

10 different, common sequence added thereto, a single set of 
5' and 3' primers can be used to amplify all of the probes 
from the amplif iable ordered set. 

Such collections of genome-derived single exon 
probes can usefully include a plurality of probes chosen 

15 for the common attribute of expression in the human adult 
liver. 

In such defined subsets, typically at least 50, 
60, 75, 80, 85, 90 or 95% or more of the probes will be 
chosen by their expression in the defined tissue or cell 
20 type . 

The single exon probes of the present invention, 
as well as fragments of the single exon probes comprising 
selectively hybridizable portions of the probe ORF, can be 
used to obtain the full length cDNA that includes the ORF 

25 by (i) screening of cDNA libraries; (ii) rapid 

amplification of cDNA ends ("RACE"); or (iii) other 
conventional means, as are described, inter alia, in 
Ausubel et al. and Maniatis et al. 

It is another aspect of the present invention to 

30 provide genome-derived single exon nucleic acid microarrays 
useful for gene expression analysis, where the term 
"microarray" has the meaning given in the definitional 
section of this description, supra. 

The invention particularly provides genome- 

35 derived single-exon nucleic acid microarrays comprising a 
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plurality of probes known to be expressed in human adult 
liver. In preferred embodiments, the present invention 
provides human genome-derived single exon microarrays 
comprising a plurality of probes drawn from the group 
5 consisting of SEQ ID NOS . : 1 - 13,109. 

When used for gene expression analysis, the 
genome-derived single exon microarrays provide greater 
physical informational density than do the genome-derived 
single exon microarrays that have lower percentages of 

10 probes known to be expressed commonly in the tested tissue. 
At a fixed probe density, for example, a given microarray 
surface area of the defined subset genome-derived single 
exon microarray can yield a greater number of expression 
measurements. Alternatively, at a given probe density, the 

15 same number of expression measurements can be obtained from 
a smaller substrate surface area. Alternatively, at a 
fixed probe density and fixed surface area, probes can be 
provided redundantly, providing greater reliability in 
signal measurement for any given probe. Furthermore, with 

20 a higher percentage of probes known to be expressed in the 
assayed tissue, the dynamic range of the detection means 
can be adjusted to reveal finer levels discrimination among 
the levels of expression. 

Although particularly described with respect to 

25 their utility as probes of gene expression, particularly as 
probes to be included on a genome-derived single exon 
microarray, each of the nucleic acids having SEQ ID NOS.: 1 
- 13, 109contains an open-reading frame, set forth 
respectively in SEQ ID NOS.: 13,110 - 25,995, that encodes 

30 a protein domain. Thus, each of SEQ ID NOS. 1 - 13,109can 
be used, or that portion thereof in SEQ ID NOS. 13,110 - 
25,995 used, to express a protein domain by standard in 
vitro recombinant techniques. See Ausubel et al. and 
Maniatis et al. 

35 Additionally, kits are available commercially 

73 



WO 01/57273 PCT/US01/00664 

that readily permit such nucleic acids to be expressed as 
protein in bacterial cells, insect cells, or mammalian 
cells, as desired (e.g., HAT*" Protein Expression & 
Purification System, ClonTech Laboratories, Palo Alto, CA; 
5 Adeno-X™ Expression System, ClonTech Laboratories, Palo 

Alto, CA; Protein Fusion & Purification ( pMAL™ ) System, New 
England Biolabs, Beverley, MA) 

Furthermore, shorter peptides can be chemically 
synthesized using commercial peptide synthesizing equipment 

10 and well known techniques. Procedures are described, inter 
alia, in Chan et al. (eds.), Fmoc Solid Phase Peptide 
Synthesis: A Practical Approach (Practical Approach Series, 
(Paper)), Oxford Univ. Press (March 2000) (ISBN: 
0199637245); Jones, Amino Acid and Peptide Synthesis 

15 (Oxford Chemistry Primers, No 7) , Oxford Univ. Press 

(August 1992) (ISBN: 0198556683); and Bodanszky, Principles 
of Peptide Synthesis (Springer Laboratory) , Springer Verlag 
(December 1993) (ISBN: 0387564314). 

It is, therefore, another aspect of the invention 

20 to provide peptides comprising an amino acid sequence 

translated from SEQ ID NOS . : 13,110 - 25,995. Such amino 
acid sequences are set out in SEQ ID NOS: 25,996 - 38,578. 
Any such recombinantly-expressed or synthesized peptide of 
at least 8, and preferably at least about 15, amino acids, 

25 can be conjugated to a carrier protein and used to generate 
antibody that recognizes the peptide. Thus, it is a 
further aspect of the invention to provide peptides that 
have at least 8, preferably at least 15, consecutive amino 
acids. 

30 

The following examples are offered by way of 
illustration and not by way of limitation. 



EXAMPLE 1 

35 Preparation of Single Exon Microarrays from ORFs Predicted 



WO 01/57273 

in Human Genomic Sequence 



PCT/US01/00664 



Bioinf ormatics Results 

All human BAC sequences in fewer than 10 pieces 
5 that had been accessioned in a five month period 

immediately preceding this study were downloaded from 
GenBank. This corresponds to ~2200 clones, totaling ~350 
MB of sequence, or approximately 10% of the human genome. 

After masking repetitive elements using the 

10 program CROSS_MATCH, the sequence was analyzed for open 
reading frames using three separate gene finding programs. 
The three programs predict genes using independent 
algorithmic methods developed on independent training sets: 
GRAIL uses a neural network, GENEFINDER uses a hidden 

15 Markoff model, and DICTION, a program proprietary to 
Genetics Institute, operates according to a different 
heuristic. The results of all three programs were used to 
create a prediction matrix across the segment of genomic 
DNA. 

20 The three gene finding programs yielded a range 

of results. GRAIL identified the greatest percentage of 
genomic sequence as putative coding region, 2% of the data 
analyzed. GENEFINDER was second, calling 1%, and DICTION 
yielded the least putative coding region, with 0.8% of 

25 genomic sequence called as coding region. 

The consensus data were as follows. GRAIL and 
GENEFINDER agreed on 0.7% of genomic sequence, GRAIL and 
DICTION agreed on 0.5% of genomic sequence, and the. three 
programs together agreed on 0.25% of the data analyzed. 

30 That is, 0.25% of the genomic sequence was identified by 
all three of the programs as containing putative coding 
region. 

ORFs predicted by any two of the three programs 
("consensus ORFs") were assorted into "gene bins" using two 
35 criteria: (1) any 7 consecutive exons within a 25 kb window 
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were placed together in a bin as likely contributing to a 
single gene, and (2) all ORFs within a 25 kb window were 
placed together in a bin as likely contributing to a single 
gene if fewer than 7 exons were found within the 25 kb 
5 window. 

PCR 

The largest ORF from each gene bin that did not 
span repetitive sequence was then chosen for amplification, 

10 as were all consensus ORFs longer than 500 bp. This method 
approximated one exon per gene; however, a number of genes 
were found to be represented by multiple elements. 

Previously, we had determined that DNA fragments 
fewer than 250 bp in length do not bind well to the amino- 

15 modified glass surface of the slides used as support 
substrate for construction of microarrays; therefore, 
amplicons were designed in the present experiments to 
approximate 500 bp in length. 

Accordingly, after selecting the largest ORF per 

20 gene bin, a 500 bp fragment of sequence centered on the ORF 
was passed to the primer picking software, PRIMER3 
(available online for use at 

http://www-genome.wi.mit.edu/cgi-bin/primer/ ). A first 
additional sequence was commonly added to each ORF-unique 

25 5' primer, and a second, different, additional sequence was 
commonly added to each ORF-unique 3' primer, to permit 
subsequent reamplif ication of the amplicon using a single 
set of "universal" 5' and 3" primers, thus immortalizing 
the amplicon. The addition of universal priming sequences 

30 also facilitates sequence verification, and can be used to 
add a cloning site should some ORFs be found to warrant 
further study. 

The ORFs were then PCR amplified from genomic 
DNA, verified on agarose gels, and sequenced using the 

35 universal primers to validate the identity of the amplicon 
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to be spotted in the microarray. 

Primers were supplied by Operon Technologies 
(Alameda, CA) . PCR amplification was performed by standard 
techniques using human genomic DNA (Clontech, Palo Alto, 
5 CA) as template. Each PCR product was verified by SYBR® 
green (Molecular Probes, Inc., Eugene, OR) staining of 
agarose gels, with subsequent imaging by Fluorimager 
(Molecular Dynamics, Inc., Sunnyvale, CA) . PCR 
amplification was classified as successful if a single band 
10 appeared. 

The success rate for amplifying ORFs of interest 
directly from genomic DNA using PCR was approximately 75%. 
FIG. 5 graphs the distribution of predicted ORF (exon) 
length and distribution of amplified PCR products, with ORF 

15 length shown in red and PCR product length shown in blue 

(which may appear black in the figure) . Although the range 
of ORF sizes is readily seen to extend to beyond 900 bp, 
the mean predicted exon size was only 229 bp, with a median 
size of 150 bp (n=9498) . With an average amplicon size of 

20 475 + 25 bp, approximately 50% of the average PCR 

amplification product contained predicted coding region, 
with the remaining 50% of the amplicon containing either 
intron, intergenic sequence, or both. 

Using a strategy predicated on amplifying about 

25 500 bp, it was found that long exons had a higher PCR 

failure rate. To address this, the bioinf ormatics process 
was adjusted to amplify 1000, 1500 or 2000 bp fragments 
from exons larger than 500 bp. This improved the rate of 
successful amplification of exons exceeding 500 bp, 

30 constituting about 9.2% of the exons predicted by the gene 
finding algorithms. 

Approximately 75% of the probes disposed on the 
array (90% of those that successfully PCR amplified) were 
sequence-verified by sequencing in both the forward and 

35 reverse direction using MegaBACE sequencer (Molecular 
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Dynamics, Inc., Sunnyvale, CA) , universal primers, and 

standard protocols. 

Some genomic clones (BACs) yielded very poor PCR 

and sequencing results. The reasons for this are unclear, 
5 but may be related to the quality of early draft sequence 

or the inclusion of vector and host contamination in some 

submitted sequence data. 

Although the intronic and intergenic material 

flanking coding regions could theoretically interfere with 
10 hybridization during microarray experiments, subsequent 

empirical results demonstrated that differential expression 

ratios were not significantly affected by the presence of 

noncoding sequence. The variation in exon size was 

similarly found not to affect differential expression 
15 ratios significantly; however, variation in exon size was 

observed to affect the absolute signal intensity (data not 

shown) . 

The 350 MB of genomic DNA was, by the above- 
described process, reduced to 9750 discrete probes, which 

20 were spotted in duplicate onto glass slides using 

commercially available instrumentation (MicroArray Genii 
Spotter and/or MicroArray Genii I Spotter, Molecular 
Dynamics, Inc., Sunnyvale, CA) . Each slide additionally 
included either 16 or 32 E. coli genes, the average 

25 hybridization signal of which was used as a measure of 
background biological noise. 

Each of the probe sequences was BLASTed against 
the human EST data set, the NR data set, and SwissProt 
GenBank (May 7, 1999 release 2.0.9). 

30 One third of the probe sequences (as amplified) 

produced an exact match (BLAST Expect ("E") values less 
than 1 e" 100 ) to either an EST (20% of sequences) or a known 
mRNA (13% of sequences) . A further 22% of the probe 
sequences showed some homology to a known EST or mRNA 

35 (BLAST E values from 1 e" 5 to 1 e~") . The remaining 45% of 
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the probe sequences showed no significant sequence homology 
to any expressed, or potentially expressed, sequences 
present in public databases. 

All of the probe sequences (as amplified) were 
5 then analyzed for protein similarities with the SwissProt 
database using BLASTX, Gish et al., Nature Genet. 3:266 
(1993) . The predicted functional breakdowns of the 2/3 of 
probes identical or homologous to known sequences are 
presented in Table 1. 

10 

Table 1 



Function of Predicted ORFs As Deduced From Comparative 
Sequence Analysis 

Total V6 chip V7 chip Function Predicted from 
Comparative Sequence 
Analysis 


211 


96 


115 


Receptor 


120 


43 


77 


Zinc Finger 


30 


11 


19 


Homeooox 


25 


9 


16 


Transcription Factor 


17 


11 


7 


Transcription 


118 


57 


61 


Structural 


95 


39 


56 


Kinase 


36 


18 


18 


Phosphatase 


83 


31 


52 


Ribosomal 


45 


19 


26 


Transport 


21 


17 


14 


Growth Factor 


17 


12 


5 


Cytochrome 


50 


33 


17 


Channel 



As can be seen, the two most common types of 
genes were transcription factors and receptors, making up 
15 2.2% and 1.8% of the arrayed elements, respectively. 
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EXAMPLE 2 

Gene Expression Measurements From Genome-Derived Single 
5 Exon Microarrays 

The two genome-derived single exon microarrays 
prepared according to Example 1 were hybridized in a series 

10 of simultaneous two-color fluorescence experiments to (1) 
Cy3-labeled cDNA synthesized from message drawn 
individually from each of brain, heart, liver, fetal liver, 
placenta, lung, bone marrow, HeLa, BT 474, or HBL 100 
cells, and (2) Cy5-labeled cDNA prepared from message 

15 pooled from all ten tissues and cell types, as a control in 
each of the measurements. Hybridization and scanning were 
carried out using standard protocols and Molecular Dynamics 
equipment . 

Briefly, mRNA samples were bought from commercial 

20 sources (Clontech, Palo Alto, CA and Amersham Pharmacia 

Biotech (APB) ) . Cy3-dCTP and Cy5-dCTP (both from APB) were 
incorporated during separate reverse transcriptions of 1 ug 
of polyA + mRNA performed using 1 ug oligo (dT) 12-18 primer 
and 2 ug random 9mer primers as follows. After heating to 

25 70°C, the RNA: primer mixture was snap cooled on ice. After 
snap cooling on ice, added to the RNA to the- stated final 
concentration was: IX Superscript II buffer, 0.01 M DTT, 
lOOuM dATP, 100 pM dGTP, 100 uM dTTP, 50 uM dCTP, 50 uM 
Cy3-dCTP or Cy5-dCTP 50 uM, and 200 U Superscript II 

30 enzyme. The reaction was incubated for 2 hours at 42°C. 

After 2 hours, the first strand cDNA was isolated by adding 
1 U Ribonuclease H, and incubating for 30 minutes at 37°C. 
The reaction was then purified using a Qiagen PCR cleanup 
column, increasing the number of ethanol washes to 5. 

35 Probe was eluted using 10 mM Tris pH 8.5. 
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Using a spectrophotometer, probes were measured 
for dye incorporation. Volumes of both Cy3 and Cy5 cDNA 
corresponding to 50 pmoles of each dye were then dried in a 
Speedvac, resuspended in 30 ul hybridization solution 
5 containing 50% formamide, 5X SSC, 0.2 ug/ul poly(dA), 0.2 
ug/ul human c G tl DNA, and 0.5 % SDS. 

Hybridizations were carried out under a 
coverslip, with the array placed in a humid oven at 42°C 
overnight. Before scanning, slides were washed in IX SSC, 
10 0.2% SDS at 55°C for 5 minutes, followed by 0 . IX SSC, 0.2% 
SDS, at 55°C for 20 minutes. Slides were briefly dipped in 
water and dried thoroughly under a gentle stream of 
nitrogen . 

Slides were scanned using a Molecular Dynamics 

15 Gen3 scanner, as described. Schena (ed. ) , Microarray 
Biochip: Tools and Technology , Eaton Publishing 
Company/BioTechniques Books Division (2000) (ISBN: 
1881299376) . 

Although the use of pooled cDNA as a reference 

20 permitted the survey of a large number of tissues, it 
attenuates the measurement of relative gene expression, 
since every highly expressed gene in the tissue/cell type- 
specific fluorescence channel will be present to a level of 
at least 10% in the control channel. Because of this fact, 

25 both signal and expression ratios (the latter hereinafter, 
"expression" or "relative expression") for each probe were 
normalized using the average ratio or average signal, 
respectively, as measured across the whole slide. 

Data were accepted for further analysis only when 

30 signal was at least three times greater than biological 

noise, the latter defined by the average signal produced by 
the E. coli control genes. 

The relative expression signal for these probes 
was then plotted as function of tissue or cell type, and is 

35 presented in FIG. 6. 
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FIG. 6 shows the distribution of expression 
across a panel of ten tissues. The graph shows the number 
of sequence-verified products that were- either not 
expressed ("0"), expressed in one or more but not all 
5 tested tissues ("1" - "9"), and expressed in all tissues 
tested ("10") . 

Of 9999 arrayed elements on the two microarrays 
(including positive and negative controls and "failed" 
products), 2353 (51%) were expressed in at least one tissue 

10 or cell type. Of the gene elements showing significant 
signal — where expression was scored as "significant" if 
the normalized Cy3 signal was greater than 1, representing 
signal 5-fold over biological noise (0.2) - 39% (991) were 
expressed in all 10 tissues. The next most common class 

15 (15%) consisted of gene elements expressed in only a single 
tissue. 

The genes expressed in a single tissue were 
further analyzed, and the results of the analyses are 
compiled in FIG. 7. 

20 FIG. 7A is a matrix presenting the expression of 

all verified sequences that showed expression greater than 
3 in at least one tissue. Each clone is represented by a 
column in the matrix. Each of the 10 tissues assayed is 
represented by a separate row in the matrix, and relative 

25 expression of a clone in that tissue is indicated at the 
respective node by intensity of green shading, with the 
intensity legend shown in panel B. The top row of the 
matrix ("EST Hit") contains "bioinf ormatic" rather than 
"physical" expression data — that is, presents the results 

30 returned by query of EST, NR and SwissProt databases using 
the probe sequence. The legend for "bioinf ormatic 
expression" (i.e., degree of homology returned) is 
presented in panel C. Briefly, white is known, black is 
novel, with gray depicting nonidentical with significant 

35 homology (white: E values < le-100; gray: E values from le- 
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05 to le-99; black: E values > le-05) . 

As FIG. 7 readily shows, heart and brain were 
demonstrated to have the greatest numbers of genes that 
were shown to be uniquely expressed in the respective 
5 tissue. In brain, 200 uniquely expressed genes were 

identified; in heart, 150. The remaining tissues gave the 
following figures for uniquely expressed genes: liver, 100; 
lung, 70; fetal liver, 150; bone marrow, 75; placenta, 100; 
HeLa, 50; HBL, 100; and BT474, 50. 

10 It was further observed that there were many more 

"novel" genes among those that were up-regulated in only 
one tissue, as compared with those that were down-regulated 
in only one tissue. In fact, it was found that ORFs whose 
expression was measurable in only a single of the tested 

15 tissues were represented in sequencing databases at a rate 
of only 11%, whereas 36% of the ORFs whose expression was 
measurable in 9 of the tissues were present in public 
databases. As for those ORFs expressed in all ten tissues, 
fully 45% were present in existing expressed sequence 

20 databases. These results are not unexpected, since genes 
expressed in a greater number of tissues have a higher 
likelihood of being, and thus of having been, discovered by 
EST approaches. 

25 Comparison of Signal from Known and Unknown Genes 

The normalized signal of the genes found to have 
high homology to genes present in the GenBank human EST 
database were compared to the normalized signal of those 
genes not found in the GenBank human EST database. The 

30 data are shown in FIG. 8. 

FIG. 8 shows the normalized Cy3 signal intensity 
for all sequence-verified products with a BLAST Expect 
("E") value of greater than le-30 (designated "unknown") 
upon query of existing EST, NR and SwissProt databases, and 

35 shows in blue the normalized Cy3 signal intensity for all 
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sequence-verified products with a BLAST Expect value of 
less than le-30 ("known"). Note that biological background 
noise has an averaged normalized Cy3 signal intensity of 
0.2. 

5 As expected, the most highly expressed of the 

ORFs were "known" genes. This is not surprising, since 
very high signal intensity correlates with very commonly- 
expressed genes, which have a higher likelihood of being 
found by EST sequence. 

10 However, a significant point is that a large 

number of even the high expressers were "unknown". Since 
the genomic approach used to identify genes and to confirm 
their expression does not bias exons toward either the 3' 
or 5' end of a gene, many of these high expression genes 

15 will not have been detected in an end-sequenced cDNA 
library. 

The significant point is that presence of the 
gene in an EST database is not a prerequisite for 
incorporation into a genome-derived microarray, and 
20 further, that arraying such "unknown" exons can help to 
assign function to as-yet undiscovered genes. 

Verification of Gene Expression 

To ascertain the validity of the approach 

25 described above to identify genes from raw genomic 

sequence, expression of two of the probes was assayed using 
reverse transcriptase polymerase chain reaction (RT PCR) 
and northern blot analysis. 

Two microarray probes were selected on the basis 

30 of exon size, prior sequencing success, and tissue-specific 
gene expression patterns as measured by the microarray 
experiments. The primers originally used to amplify the 
two respective ORFs from genomic DNA were used in RT PCR 
against a panel of tissue-specific cDNAs (Rapid-Scan gene 

35 expression panel 24 human cDNAs) (OriGene Technologies, 
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Inc., Rockville, MD) . 

Sequence AL079300_1 was shown by microarray 
hybridization to be present in cardiac tissue, and sequence 
AL031734_1 was shown by microarray experiment to be present 
5 in placental tissue (data not shown) . RT-PCR on these two 
sequences confirmed the tissue-specific gene expression as 
measured by microarrays, as ascertained by the presence of 
a correctly sized PCR product from the respective tissue 
type cDNAs . 

10 Clearly, all microarray results cannot, and 

indeed should not, be confirmed by independent assay 
methods, or the high throughput, highly parallel advantages 
of microarray hybridization assays will be lost. However, 
in addition to the two RT-PCR results presented above, the 

15 observation that 1/3 of the arrayed genes exist in 

expression databases provides powerful confirmation of the 
power of our methodology - which combines bioinf ormatic 
prediction with expression confirmation using genome- 
derived single exon microarrays — to identify novel genes 

20 from raw genomic data. 

To verify that the approach further provides 
correct characterization of the expression patterns of the 
identified genes, a detailed analysis was performed of the 
microarrayed sequences that showed high signal in brain. 

25 For this latter analysis, sequences that showed 

high (normalized) signal in brain, but which showed very 
low (normalized) signal (less than 0.5, determined to be 
biological noise) in all other tissues, were further 
studied. There were 82 sequences that fit these criteria, 

30 approximately 2% of the arrayed elements. The 10 sequences 
showing the highest signal in brain in microarray 
hybridizations are detailed in Table 2, along with assigned 
function, if known or reasonably predicted. 

35 Table 2 
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Function of the Most Highly 
Expressed Genes Expressed Only in Brain 

Microarray Normal Expressi Homology Gene Function 
Sequence ized on Ratio to EST as described by 
Name Signal present GenBank 
in 

GenBank 


AP000217-1 


5.2 


+7.7 


High 


S-100 protein, 
b-chain, Ca 2+ 
binding protein 
expressed in 
central nervous 
system 


AP000047-1 


2.3 




High 


Unknown 
Function 


AC006548-9 


1.7 




High 


Similar to 
mouse membrane 
glyco-protein 
M6, expressed 
in central 
nervous system 


AC007245-5 


1.5 




High 


Similar to 
amphiphysin, a 
synaptic 
vesicle- 
associated 
protein. Ref 21 


L44140-4 


1.2 


+2.0 


High 


Endothelial 
actin-binding 
protein found 
in nonmuscle 
f ilamin 



WO 01/57273 PCT/US01/00664 



AC004689-9 


1.2 


+3.5 


High 


Protein 
Phosphatase 
PP2A neuronal/ 
downregulates 
activated 
protein kinases 


AL031657-1 


1.2 


+3.0 


High 


Unknown 
function/ 
Contains the 
anhyrin motif, 
a common 
protein 

sequence motif 


AC009266-2 


1.1 


+ 3.7 


Low 


Low homology to 
the 

Synaptotagmin I 
protein in 
rat/present at 
low levels 
throughout rat 


AP000086-1 


1.0 


+2.7 


Low 


Unknown, very 
poor homology 
to collagen 


AC004689-3 


1.0 




High 


Protein 
Phosphatase 
PP2A, neuronal/ 
downregulates 
activated 
protein kinases 



Of the ten sequences studied by these latter 
confirmatory approaches, eight were previously known. Of 
these eight, six had previously been reported to be 
5 important in the central nervous system or brain. The exon 
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giving the highest signal (AP00217-1) was found to be the 
gene encoding an S100B Ca 2+ binding protein, reported in 
the literature to be highly and uniquely expressed in the 
central nervous system. Heizmann, Neurochem. Res. 9:1097 
5 (1997) . 

A number of the brain-specific probe sequences 
(including AC006548-9, AC009266-2) did not have homology to 
any known human cDNAs in GenBank but did show homology to 
rat and mouse cDNAs . Sequences AC004689-9 and AC004689-3 

10 were both found to be phosphatases present in neurons 
(Millward et al. r Trends Biochem. Sci. 24 ( 5 ) : 1 8 6-1 91 
(1999)). Two microarray sequences, AP000047-1 and 
AP000086-1 have unknown function, with AP000086-1 being 
absent from GenBank. Functionality can now be narrowed 

15 down to a role in the central nervous system for both of 
these genes, showing the power of designing microarrays in 
this fashion. 

Next, the function of the chip sequences with the 
highest (normalized) signal intensity in brain, regardless 

20 of expression in other tissues, was assessed. In this 
latter analysis, we found expression of many more common 
genes, since the sequences were not limited to those 
expressed only in brain. For example, looking at the 20 
highest signal intensity spots in brain, 4 were similar to 

25 tubulin (AC00807905; AF146191-2; AC007664-4; AF14191-2), 2 
were similar to actin (AL035701-2 ; AL034402-1), and 6 were 
found to be homologous to glyceraldehyde-3-phosphate 
dehydrogenase (GAPDH) (AL035604-1; Z86090-1; AC006064-L, 
AC006064-K; AC035604-3; AC006064-L) . These genes are often 

30 used as controls or housekeeping genes in microarray 
experiments of all types. 

Other interesting genes highly expressed in brain 
were a ferritin heavy chain protein, which is reported in 
the literature to be found in brain and liver (Joshi et 

35 al., J. Neurol. Sci. 134 (Suppl) : 52-56 (1995)), a result 
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duplicated with the array. Other highly expressed chip 
sequences included a translation elongation factor 10 
(AC007564-4) , a DEAD-box homolog (AL023804-4) , and a Y- 
chromosome RNA-binding motif (Chai et al., Genomics 
5 49(2):283-89 (1998 ) ) (AC007320-3) . A low homology analog 
(AP00123-1/2) to a gene, DSCR1, thought to be involved in 
trisomy 21 (Down's syndrome), showed high expression in 
both brain and heart, in agreement with the literature 
(Fuentes et al., Mol . Genet. 4 (10) : 1935-44 (1995)). 
10 As a further validation of the approach, we 

selected the BAC AC006064 to be included on the array. 
This BAC was known to contain the GAPDH gene, and thus 
could be used as a control for the ORF selection process. 
The gene finding and exon selection algorithms resulted in 
15 choosing 25 exons from BAC AC006064 for spotting onto the 
array, of which four were drawn from the GAPDH gene. Table 
3 shows the comparison of the average expression ratio for 
the 4 exons from BAC006064 compared with the average 
expression ratio for 5 different dilutions of a 
20 commercially available GAPDH cDNA (Clontech) . 



Table 3 



Comparison of Expression Ratio, for each 
tissue, of GAPDH 




AC006064 (n = 4) 


Control ( n = 5) 


Bone Marrow 


-1.81 + 0.11 


-1.85 + 0.08 


Brain 


-1.41 + 0.11 


-1.17 + 0.05 


BT474 


1.85 ± 0.09 


1.66 ± 0.12 


Fetal Liver 


-1.62 + 0.07 


-1.41 ± 0.05 


HBL100 


1.32 ± 0.05 


2.64 ± 0.12 


Heart 


1.16 ± 0.09 


1.56 ± 0.10 


HeLa 


1.11 +0.06 


1.30 ± 0.15 


Liver 


-1.62 ± 0.22 


-2.07 ± 
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Lung 


-4.95 ± 0.93 


-3.75 + 0.21 


Placenta 


-3.56 ± 0.25 


-3.52 ± 0.43 



Each tissue shows excellent agreement between the 
experimentally chosen exons and the control, again 

5 demonstrating the validity of the present exon mining 

approach. In addition, the data also show the variability 
of expression of GAPDH within tissues, calling into 
question its classification as a housekeeping gene and 
utility as a housekeeping control in microarray 

10 experiments. 

EXAMPLE 3 

Representation of Sequence and Expression Data as a 
"Mondrian" 

15 

For each genomic clone processed for microarray 
as above-described, a plethora of information was 
accumulated, including full clone sequence, probe sequence 
within the clone, results of each of the three gene finding 

20 programs, EST information associated with the probe 
sequences, and microarray signal and expression for 
multiple tissues, challenging our ability to display the 
information . 

Accordingly, we devised a new tool for visual 

25 display of the sequence with its attendant annotation 
which, in deference to its visual similarity to the 
paintings of Piet Mondrian, is hereinafter termed a 
"Mondrian". FIGS. 3 and 4 present the key to the 
information presented on a Mondrian. 

30 FIG. 9 presents a Mondrian of BAC AC008172 (bases 

25,000 to 130,000 shown), containing the carbamyl phosphate 
synthetase gene (AF154830 . 1) . Purple background within the 
region shown as field 81 in FIG. 3 indicates all 37 known 
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exons for this gene. 

As can be seen, GRAIL II successfully identified 
27 of the known exons (73%) , GENEFINDER successfully 
identified 37 of the known exons (100%), while DICTION 
5 identified 7 of the known exons (19%) . 

Seven of the predicted exons were selected for 
physical assay, of which 5 successfully amplified by PCR 
and were sequenced. These five exons were all found to be 
from the same gene, the carbamyl phosphate synthetase gene 

10 (AF154830.1) . 

The five exons were arrayed, and gene expression 
measured across 10 tissues. As is readily seen in the 
Mondrian, the five chip sequences on the array show 
identical expression patterns, elegantly demonstrating the 

15 reproducibility of the system. 

FIG. 10 is a Mondrian of BAC AL049839. We 
selected 12 exons from this BAC, of which 10 successfully 
sequenced, which were found to form between 5 and 6 genes. 
Interestingly, 4 of the genes on this BAC are protease 

20 inhibitors. Again, these data elegantly show that exons 
selected from the same gene show the same expression 
patterns, depicted below the red line. From this figure, 
it is clear that our ability to find known genes is very 
good. A novel gene is also found from 86.6 kb to 88.6 kb, 

2'5 upon which all the exon finding programs agree. We are 
confident we have two exons from a single gene since they 
show the same expression patterns and the exons are 
proximal to each other. Backgrounds in the following 
colors indicate a known gene (top to bottom) : 

30 red = kallistatin protease inhibitor (P29622); 

purple = plasma serine protease inhibitor (P05154) ; 
turquoise = al anti-chymotrypsin (P01011) ; mauve = 40S 
ribosomal protein (P08865) . Note that chip sequence 8 and 
12 did not sequence verify. 

35 
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EXAMPLE 4 

Genome-Derived Single Exon Probes Useful For Measuring 
Human Gene Expression 

5 

The protocols set forth in Examples 1 and 2, 
supra, were applied to additional human genomic sequence as 
it became newly available in GenBank to identify unique 
exons in the human genome that could be shown to be 

10 expressed at significant levels in liver tissue. 

These unique exons are within longer probe 
sequences. Each probe was completely sequenced on both 
strands prior to its use on a genome-derived single exon 
microarray; sequencing confirms the exact chemical 

15 structure of each probe. An added benefit of sequencing is 
that it placed us in possession of a set of single base- 
incremented fragments of the sequenced nucleic acid, 
starting from the sequencing primer 3' OH. (Since the 
single exon probes were first obtained by PCR amplification 

20 from genomic DNA, we were of course additionally in 

possession of an even larger set of single base incremented 
fragments of each of the 13,109 single exon probes, each 
fragment corresponding to an extension product from one of 
the two amplification primers.) 

25 The structures of the 13,109 unique single exon 

probes are clearly presented in the Sequence Listing as SEQ 
ID Nos.: 1 - 13,109. The 16 nt 5' primer sequence and 16 
nt 3' primer sequence present on the amplicon are not 
included in the sequence listing. The sequences of the 

30 exons present within each of these probes is presented in 
the Sequence Listing as SEQ ID Nos.: 13,110 - 25,995, 
respectively. It will be noted that some amplicons have 
more than one exon, some exons are contained in more than 
one amplicon. 

35 As detailed in Example 2, expression was 
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demonstrated by disposing the amplicons as single exon 
probes on nucleic acid microarrays and then performing two- 
color fluorescent hybridization analysis; significant 
expression is based on a statistical confidence that the 
5 signal is significantly greater than negative biological 
control spots. The negative biological control is formed 
from spotted DNA sequences from a different species. Here, 
32 sequences from E.Coli were spotted in duplicate to give 
a total of 64 spots. 

10 For each hybridisation (each slide, each colour) 

the median value of the signal from all of the spots is 
determined. The normalised signal value is the arithmetic 
mean of the signal from duplicate spots divided by the 
population median. 

15 Control spots are eliminated if there is more 

that a five-fold difference between each one of the 
duplicate spots raw signals. 

The median of the signal from the remaining 
control spots is calculated and all subsequent calculations 

20 are done with normalised signals. 

Control spots having a 'signal of greater than 
median + 2.4 (the value 2.4 is roughly 12 times the 
observed standard deviation of control spot populations) 
are eliminated. Spots with such high signals are considered 

25 to be "outliers". 

The mean and standard deviation of the modified 
control spot populations are calculated. 

The mean + 3x the standard deviation (mean + 
(3*SD) ) is used as the signal threshold qualifier for that 

30 particular hybridisation. Thus, individual thresholds are 
determined for each channel and each hybridisation. 

This means that, assuming that the data is 
distributed normally, there is a 99% confidence that any 
signal exceeding the threshold is significant. 

35 The probes and their expression data are 
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presented in Table 4, set forth respectively in Example 5. 
Example 5 presents the subset of probes that is 
significantly expressed in the human adult liver and thus 
presents the subset of probes that was recognized to be 
5 useful for measuring expression of their cognate genes in 
human adult liver tissue. 

The sequence of each of the exon probes 
identified by SEQ ID NOS . : 13,110 - 25,995 was individually 
used as a BLAST (or, for SWISSPROT, BLASTX) query to 
10 identify the most similar sequence in each of dbEST, 

SwissProt (BLASTX) , and NR divisions of GenBank. Because 
the query sequences are themselves derived from genomic 
sequence in GenBank, only nongenomic hits from NR were 
scored. 

15 The smallest in value of the BLAST (or BLASTX) 

expect ( "E" ) scores for each query sequence across the 
three database divisions was used as a measure of the 
"expression novelty" of the probe's ORF. Table 4 is sorted 
in descending order based on this measure, reported as 

20 "Most Similar (top) Hit BLAST E Value". Those sequences for 
which no "Hit E Value" is listed are those exons which were 
found to have no similar sequences. 

As sorted, Table 4 thus lists its respective 
probes (by "AMPLICON SEQ ID NO.:" and additionally by the 

25 SEQ ID NO:, of the exon contained within the probe: "EXON 
SEQ ID NO.:") from least similar to sequences known to be 
expressed (i.e., highest BLAST E value), at the beginning 
of the table, to most similar to sequences known to be 
expressed (i.e., lowest BLAST E value), at the bottom of 

30 the table. 

Table 4 further provides, for each listed probe, 
the accession number of the database sequence that yielded 
the "Most Similar (top) Hit BLAST E Value", along with the 
name of the database in which the database sequence is 
35 found ("Top Hit Database Source") . 
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Table 4 further provides SEQ ID NOS. 
corresponding to the predicted amino acid sequences where 
they have been determined for the probe and exon nucleotide 
sequences. These are set out as PEPTIDE SEQ ID NOS.:. The 
5 peptide sequences for a given exon are predicted as 

follows: Since each chip exon is a consensus sequence drawn 
from predictions from various exon finding programs (i.e. 
Grail, GeneFinder and GenScan) , the multiple initial ORFs 
are first determined in a uniform way according to each 

10 prediction. In particular, the reading frame for predicting 
the first amino acid in the peptide sequence always starts 
with the first base of any codon and ends with the last 
base of non-termination codon. Next, for each strand of the 
exon, initial ORFs are merged into one or more final ORFs 

15 in an exhaustive process based on the following criteria: 
1) the merging ORFs must be overlapping, and 2) the merging 
ORFs must be in the same frame. 

The Sequence Listing, which is a superset of all 
of the data presented in Table 4, further includes, for 

20 each probe, the most similar hit, with accession number and 
BLAST E value, from the each of the three queried 
databases . 

Table 4 further lists, for each probe, a portion 
of the descriptor for the top hit ("Top Hit Descriptor") as 

25 provided in the sequence database. For those ORFs that are 
similar in sequence, but nonidentical to known sequences 
(e.g., those with BLAST E values between about le-05 and 
le-100) , the descriptor reveals the likely function of the 
protein encoded by the probe's ORF. 

30 Using BLAST E value cutoffs of le-05 (i.e., 1 x 

10" 5 ) and le-100 (i.e., 1 x 10~ 100 } as evidence of similarity 
to sequences known to be expressed is of course arbitrary: 
in Example 2, supra, a BLAST E value of le-30 was used as 
the boundary when only two classes were to be defined for 

35 analysis (unknown, >le-30; known <le-30) (see also FIG. 8) . 

95 



WO 01/57273 PCT/US01/00664 

Furthermore, even when the "Most Similar (Top) Hit BLAST E 
Value" is low, e.g., less than about le-100 - which is 
probative evidence that the query sequence has previously 
been shown to be expressed - the top hit is highly unlikely 
5 exactly to match the probe sequence. 

First, such expression entries typically will not 
have the intronic and/or intergenic sequence present within 
the single exon probes listed in the Table. Second, even 
the ORF itself is unlikely in such cases to be present 

10 identically in the databases, since most of the EST and 
mRNA clones in existing databases include multiple exons, 
without any indication of the location of exon boundaries. 

As noted, the data presented in Table 4 represent 
a proper subset of the data present within the attached 

15 sequence listing. For each amplicon probe (SEQ ID NOs.: 1 
- 13,109) and probe exon (SEQ ID NOs.: 13,110 - 25,995, 
respectively), the sequence listing further provides, 
through iterated annotation fields <220> and <223>: 

(a) the accession number of the BAC from which 
20 the sequence was derived ("MAP TO"), thus providing a link 

to the chromosomal map location and other information about 
the genomic milieu of the probe sequence; 

(b) the most similar sequence provided by BLAST 
query of the EST database, with accession number and BLAST 

25 E value for the "hit"; 

(c) the most similar sequence provided by BLAST 
query of the GenBank NR database, with accession number and 
BLAST E value for the "hit"; and 

(d) the most similar sequence provided by BLASTX 
30 query of the SWISSPROT database, with accession number and 

BLAST E value for the "hit". 
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Genome-Derived Single Exon Probes Useful For Measuring 
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Expression of Genes in Human Adult liver 



Table 4 (545 pages) presents expression, homology, and 
functional information for the genome-derived single exon 
5 probes that are expressed significantly in human adult 
liver. 
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Top Hit Descriptor 


D.rerlo zp-50 POU gene | 


Homo sapiens carcinoembryonic antigen-related cell adhesion molecule 1 (biliary glyooproteri) (CEACAM1), I 
| mRNA | 


SQUALENE-HOPENE CYCLASE j 


SQUALENE-HOPENE CYCLASE 


PHOSPHOGLYCERATE KINASE, CYTOSOLIC f 


PHOSPHOGLYCERATE KINASE, CYTOSOLIC | 


NADH-UBIQUINONE OXIDOREDUCTASE CHAIN 4 


NADH-UBIQUINONE OXIDOREDUCTASE CHAIN 4 j 


VON WILLEBRAND FACTOR PRECURSOR (VWF) j 


Chlamydomonas roinhardtii chloroplast DNA for rps9, ycf4, ycf3, rps18 genes | 


l 

! 

! 

1 

% 

s- 

D 

b 

I 


PERIPLASMIC [NIFE] HYDROGENASE SMALL SUBUNIT (NIFE HYDROGENLYASE SMALL CHAIN) I 


S.oerevisiaethreoninedeaminase(ILV1)gene, completecds | 


Oryzias latipes OIGCS gene for guanyiyt cyclase C, complete cds ' 


Sls scrofa choline scetyltransferase gene, promoter region ' 


HYPOTHETICAL 142.5 KD PROTEIN C23E2.02 IN CHROMOSOME I | 


TRIOSE PHOSPHATE/PHCSPHATE TRANSLOCATOR, NON-GREEN PLASTID PRECURSOR (CTPT) | 


Bscillus alcslophil 1 peotat v t. pelE) gene, completecds j 


PROBABLE UBIQUITIN-PROTEIN LIGASE HUL4 | 


TYPE 1 lODOTHYRONlNE DEIODINASE (TYPE-I 5'DEIODINASE) (DIOI) (TYPE 1 Dl) (5DI) I 


TYPE 1 lODOTHYRONlNE DEIODINASE (TYPE-I 5'DEIODINASE) (DIOI) (TYPE 1 Dl) (5DI) j 


GLUTAMATE [NMDA] RECEPTOR SUBUNIT EPSILON 3 PRECURSOR (N-METHYL D-ASPARTATE 1 
RECEPTOR SUBTYPE 2C) (NR2C) (NMDAR2C) | 


COLLAGEN ALPHA 2(1) CHAIN PRECURSOR | 


CHorella vulgaris chloroplast, complete genome | 


HYPOTHETICAL 56.3 KD PROTEIN F52C9.5 IN CHROMOSOME III | 


DEOXYHYPUSINE SYNTHASE (DHS) ! 


GENOME POLYPROTEIN [CONTAINS: CAPSID PROTEIN C (CORE PROTEIN); MATRIX PROTEIN 
; (ENVEL0PE PROTEIN M); MAJOR ENVELOPE PROTEIN E; NONSTRUCTURAL PROTEINS NS1, 
NS2A, NS2B, NS4A AND NS4B; HELICASE (NS3); RNA-DIRECTED RNA POLYMERASE (NS5)] 


retlnoic acid nuclear receptor isoform beta 2 [mice, embryonal carcinoma cell line, PCC7-MZ1 , mRNA, 2971 I 
nt] | 


S.aureus genes encoding Sau96I DNA methyltransferase and Sau96I restriction endonuclease 
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Top Hit Descriptor 


| Corynebacterium glutamieum thrC gens for threonine synthase (EC 4.2.99.2) 


| Corynebacterium glutamieum thrC gene for threonina synthase (EC 4.2.99.2) 


1 

1 
i 

5 


! ENDOTHELIAL CELL MULTIMERIN PRECURSOR 


| B.napus DNA for myrosinase 


1S-ADENOSYLMETHIONINE SYNTHETASE (METHIONINE ADENOSYLTRANSFERASE) (ADC 
S V NTHETASE) 


I CDC10 PROTEIN HOMOLOG 


RETINAL GUANYLYL CYCLASE 2 PRECURSOR (GUANYLATE CYCLASE 2F, RETINAL) (RE 1 
!(ROD OUTER SEGMENT MEMBRANE GUANYLATE CYCLASE 2) (ROS-GC2) (GUANYLATE 
FXGC-F) 


RETINAL GUANYLYL CYCLASE 2 PRECURSOR (GUANYLATE CYCLASE 2F, RETINAL) (RE1 
(ROD OUTER SEGMENT MEMBRANE GUANYLATE CYCLASE 2) (ROS-GC2) (GUANYLATE < 
F)(GC-F) 


NADH-UBIQUINONE OXIDOREDUCTASE CHAIN 4 


Chlamydophila pneumoniae AR39, section 53 of 94 of the complete genome 


i 

1 
I 
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F.pringlei gdcsPA gene for P-protein of the glycine cleavage system 


BRAIN-SPECIFIC ANGIOGENESIS INHIBITOR 1 PRECURSOR 


BRAIN-SPECIFIC ANGIOGENESIS INHIBITOR 1 PRECURSOR 


ADHERENCE FACTOR (ADHESION AND AGGREGATION MEDIATING SURFACF ANTIGEN) 


STRUCTURAL POLYPROTEIN [CONTAINS: MAJOR STRUCTURAL PROTEIN VP2; 
NONSTRUCTURAL PROTEIN VP4; MINOR STRUCTURAL PROTEIN VP3] 


STRUCTURAL POLYPROTEIN [CONTAINS: MAJOR STRUCTURAL PROTEIN VP2; 
NONSTRUCTURAL PROTEIN VP4; MINOR STRUCTURAL PROTEIN VP3] 


602017413F1 NCI CGAP Brn64 Homo sapiens cDNA clone 1MAGE:41 53059 5' 


Saguinus oedipus gene for seminal vesicle secreted protein semenogelin 1 


Blxus harlandii maturase K (matK) gene, partial cds; chloroplast gene for chloroplast product 


Arabidopsis thaliana DNA chromosome 4, contig fragment No. 52 
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Top Hit Descriptor 


| CTD-BINDING SR-LIKE PROTEIN RA4 


| COLLAGEN ALPHA 2(1) CHAIN PRECURSOR 


COLLAGEN ALPHA 2(1) CHAIN PRECURSOR 


| CM3-MT01 1 4-01 0900-323-M 2 MT01 1 4 Homo sapiens cDNA 


ARGININE DEIMINASE (ADl) (ARGININE DIHYDROLASE) (AD) 


,abS4aD4.s1 Stratagene lung (#937210) Homo sapiens cDNA clone IMAG 
repetitive eiementcontains element L1 L1 repetitive element ; 


Homo sapiens gag-pro-pol p'ecursor protein gene, partial cds 


i PROTEIN D8 PRECURSOR 


Synechococcus sp. PCC7942 copper transporting P-ATPase (ctaA) and , 
(atpE) genes, complete cds 


Synechococcus sp. PCC7942 copper transporting P-ATPase (ctaA) and , 
(atpE) genes, complete cds 


HEDGEHOG RECEPTOR (PATCHED PROTEIN) 
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Top Hit Descriptor 


1 
1 

I 

.1 

! 
1 
I 


zd25f01.r1 Soares_fetal_heart_NbHH19W Homo sapiens cDNA clone IMAGE.341 689 5' similar to I 
gb:D29805 N-ACETYLLACTOSAMINE SYNTHASE (HUMAN); 


Homo sapiens DNA, DLEC1 to ORCTL4 gene region, section 1/2 (DLEC1 , ORCTL3, ORCTL4 genes, I 
complete cds) 


6O2186095T1 NlH_MGC_45 Homo sapiens cDNA clone IMAGE:431 0591 3' I 


Homo sapiens proliferation-associated SNF2-like protein (SMARCA6) mKNA, completecds 


Saimiri boliviensis olfactory receptor (SB027) gene, partial cds 
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! 


Brachydanto rerio MHC class II DA-beta-2*01 gene, 3' end 


Homo sapiens transglutaminase type 1 (Tgasel) gene, promoter region 


601651 ' 1 1R1 NIH_MGC_B1 Homo sapiens cDNA cbne IMAGE:3934443 3' I 


601651 111R1 NIH_MGC_81 Homo sapiens cDNA clone IMAGE3934443 3' 


IL2-UT0073-060900-145-E02 UT0073 Homo sapiens cDNA 


UI-H-BI2-ahr-b-D4-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:272751 1 3' 


RCO-CT0415-200700-C32-C10 CT0415 Homo sapiens cDNA ; 


VIRULENCE FACTOR MVIN HOMOLOG i 


Homo sapiens hypothef cal protein PRO0971 (PRO0971 ), mRNA 


Homo sapiens hypothetcal protein PRO0971 (PRO0971 ), mRNA 


M. musculus COL3A1 gene for cdlagen alpha-l 


M.musculus COL3A1 genefor cdlagen alpha-l 


Thermoartaerobacter ethanolicus D-xylose-binding protein (xylF) gene, complete cds 


ph3b6_'9/1TV Outward Alu-primed hncDNA library Homo sapiens cDNA clone ph6b6 19/1TV | 


Drosophila melanogaster signal transductlng adaptor protein (STAM), serine threonine knase lal (IAL), and 1 
zirc finger protein (DNZ1) genes, completecds I 
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CAPSID PROTEIN P40 [CONTAINS: ASSEMBLIN (PROTEASE) ; CAPSID ASSEMBLY PROTEIN] | 
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Top Hit Descriptor 


| Homo sapiens transglutaminase type 1 (Tgasel) gene, promoter region | 


|Homo sapiens unknown mRNA | 


|AV764043 MDS Homo sapiens cDNA done MDSDAH08 5' I 


| Retire norvegicu3 jun dimerization protein 2 (jdp-2) mRNA, complete cds | 


|Ciilamydop; lila pneumoniae AR3S, seciion 32 ur 94 ur the complete genome | 


|mus musculus a disintegrin and metalloproteinase domain (ADAM) 1 5 (metargidin) (Adaml 5), mRNA | 


|Potato virus A RNA complete genome, isolate U | 


IMus musculus T-cell lymphoma invasion and metastasis 1 (Tiami), mRNA | 


jPotato virus A RNA complete genome, isolate U [ 


jDeinococcus radiodurans R1 section 82 of 229 of the complete chromosome 1 i 


Irt12f10x1 NCI CGAP GC6 Homo sapiens cDNA clone IMAGE:224D587 3' similar to TR:O00237 000237 I 
|HKF-1 . ; | 


Itt12f10.x1 NCI CGAP GC6 Homo sapiens cDNA clone IMAGE:224DS87 3' similar to TR:O00237 000237 I 
|HKF-1.; ! 
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[601 502041 F1 N IH MGC_70 Homo sapiens cDNA clone IMAGE:3903659 5' I 


|601478745F1 NIH_MGC_68 Homo sapiens cDNA clone IMAGE:3881555 5' | 


jHYPOTHETlCAL 118.4 KD PROTEIN IN BAT2-DAL5 INTERGENIC REGION PRECURSOR | 


jHYPOTHETlCAL 118.4 KD PROTEIN IN BAT2-DAL5 INTERGENIC REGION PRECURSOR 1 
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|Homo sapiens WDR4 gene for WD repeat protein, comolere cds | 
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jMcuse germline IgM cha n gene, mu-delta region | 
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Top Hit Descriptor 


LYSOSOMAL ALPHA-MANNOSIDASE PRECURSOR (MANNOSIDASE, ALPHA B) (LYSOSOMAL ACID 1 
ALPHA-MANNOSIDASE) (LAMAN) ! 


wo85a07.x1 NCl_CGAP_Kid1 1 Homo sapiens cDNA clone IMAGE:2462100 3' j 


Lacfococcus lactis cremoris NCDO-invl chromosomal Inversion junction DNA ! 


Lactccoccus lactis cremoris NCDO-inv1 chromosomal Inversion Junction DNA j 


1 
I 
1 

i 

1 
i 

8 

§ 


Vibrio cholerae chromosome II, section 49 of 93 of the complete chromosome ( 


Campylobacter jejuni kanamycin phosphotransferase (aphA-7) gene, complete cds [ 


5 

5 

I 

1 
1 
1 
| 


DIHYDROPYRIMIDINASE (DHPASE) (HYDANTOINASE) (DHP) j 


MRNA 3'-END PROCESSING PROTEIN RNA15 I 


1 

£ 

E 
1 


| 

1 

% 

s 
I 


Bacillus aubtilis genomic DNA 23.9kB fragment | 


Apple mosaic vi.-us RN A 2 putative polymerase gene complete cds | 


Cav,a porcellus inwardly-rectifying potassium channel Kir22 (KCNJ12) gene, complete cds I 


602023185F1 NCI_CGAP„BrnS7 Homo sapiens cDNA clone IMAGE:41 58452 5' ! 


E1 GLYCOPROTEIN PRECURSOR (MATRIX GLYCOPROTEIN) (MEMBRANE GLYCOPROTEIN) | 


Stumira lilium cytochrome b gene, complete cds; mitochondrial gene for mitochondrial product j 


Naphthatenesulfonate-degrading bacterium BN6 2,3-dlhydroxybiphenyl dioxygenase (bphCII) gene, complete 1 
cds | 


zi22d08.s1 Soares_fetalJiver_spleen_1 NFLS_S1 Homo sapiens cDNA clone IMAGE:431 535 3' | 


HISTIDINE-RICH PROTEIN PRECURSOR (CLONE PFHRP-III) 1 


HISTIDINE-RICH PROTEIN PRECURSOR (CLONE PFHRP-III) i 


HISTIDINE-RICH PROTEIN PRECURSOR (CLONE PFHRP-III) | 


Homo sapiens hypothetical protein PRO3077 (PRO3077), mRNA | 


Elaeis oleifera sesquiterpene synthase mRNA, complete cds | 


I 

1 
8 

3 
l 

! 
l 


I 

1 

8 

£ 

i 
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Homo sapiens G-protein coupled receptor 1 4 (GPR14) gene, complete cds | 


H omo sapiens post-synaptic density 95 (DLG4) gene, complete cds 


1 
! 
| 


Arabidopsis thaliana DNA chromosome 4, oontig fragment No. 63 | 


Arabidopsis thaliana DNA chromosome 4, oontig fragment No. 33 | 
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e IMAGE.-273599 3' similar to 
^ transcript, (rRNA); gb:J04970 
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Top Hit Descriptor 


i 

£ 

I 
1 


|G.gailus T-cadherin mRNA, complete cds 


|Mus musculus subtilisin-like serine protease LPC (PC7) gene, era 


|MRO-FT0175-050900-203-g06_1 FT0175 Homo sapiens cDNA 


1 

1 

1 

.1 
i 

1 


1 

5 

1 

I 


|Homo sapiens post-synaptic density 95 (DLG4) gene, complete cd 


jT.pinnatum chloroplast rbcL gene, partial 


|G.gallus T-cadherin mRNA, complete cds 


|HIST1D!NE-RICH PROTEIN PRECURSOR (CLONE PFHRP-III) 


jHISTIDINE-RlCH PROTEIN PRECURSOR (CLONE PFHRP-lll) 


JHISTIDINE-RICH PROTEIN PRECURSOR (CLONE PFHRP-lll) 


j Human extracellular calcium-sensing receptor mRNA, complete cc 


j MR3-ST01 91-1 4C2O0-013-CO5 ST01 91 Homo sapiens cDNA 


% 
i 
% 
I 


|Homo sapiens zinc finger protein ZNF191 (ZNF191) gene, comple 


(D.hydeiayl repeat cluster DNA, fragment D 


1 

1 

1 

1 
1 
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| Cglutamicum pta gene and ackA gene 


Cglutamicum pta gene and ackA gene 


1 

| 

I 

t 

| 

i 

I 

i' 
1 
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yy39b12.s1 Scares melanocyte 2NbHM Homo sapiens cDNA clon 
gb|M87935|HUMAALU472 Human carcinoma cell-derived Alu RN, 
CARBOXYPEPTIDASE M PRECURSOR (HUMAN); 


ECDYSONE-INDUCIBLE PROTEIN E75-A 


:MR3-ST01 91-1 40200-01 3-C05 ST0191 Homo sapiens cDNA 


B 
1 
2. 

1 
h. 

E 
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E 
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l zq38f05.r1 Stratagene hNT neurcn (#937233) Homo sapiens cDN. 
,gb;D10522 Human mRNA for BOK-L protein, complete cds. (HUM, 


Homo sapiens Xq pseudoautosomal region; segment 1/2 


: AV734585 cdA Homo sapiens cDNA clone cdAAFH03 5' 


jLlactjs pyrD and pyrF genes 
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.Top Hit Descriptor 


iArabidopsis thaliana DNA, 24 kb surrounding PFL locus j 


Yersinia pseudotuberculosis psaE, psaF, adhesin (psaA), chaperone (psaB), and usher (psaC) genes, I 
complete cds i 


Homo sapiens mRNA for KIAA0934 protein, partial cds | 


£ 

£ 

I 
S 

8 

1 
1 

1 


|Mus musculus guanine nucleotide binding protein (G protein), gamma 3 subunit (Gng3), mRNA | 


DNA MISMATCH REPAIR PROTEIN MUTS | 


Homo sapiens KIAA0626 gene product (KIAA0626), mRNA | 


Klebsormidium fluitans cytochrome c oxidase subunit 2 (cox2) gene, mitochondrial gene encoding j 
mitochondrial protein, partial cds j 


Homo sapiens potassium inwardly-rectifying channel, subfamily J, member 11 (KCNJ11), mRNA J 


Homo sapiens hypothetical p'otein FLJ11280 (FLJ11280), mRNA [ 


Petroselinum crispum cytosolie glucose-6-phosphate dehydrogenase 1 (cG6PDH1) mRNA, complete cds | 


Petroselinum crispum cytosolie glucose-6-phosphale dehydrogenase 1 (cG6PDH1) mRNA, complete cds | 


iwf76e11.x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:2361 548 3' | 


Human PBI gene, completecds ! 


Human PBI gene, completecds | 


iLOW TEMPERATURE ESSENTIAL PROTEIN | 


|Taenia solium immunogenic protein Ts76 mRNA, partial cds | 


□istyostelium discoideum iscpentenyl pyrophosphate isomeiase (Dipi) mRNA, complete cds | 


iXenopus laevis rhodopsin gene, complete cds 1 


Cavio cobaya mRNA for serine/threoine kinase, comple:e cds | 


]MarchantJa polymorpha genes for 26S rRNA, 5S rRNA, 18S rRNA, 5.8S rRNA and 26S rRNA j 


Girardia tig-ina mRNA for homeodomain transcription factor (so gene) ] 


|Homo sapiens chromosome 21 segment HS21C018 I 


Aedes aegypti mucin-like protein MUC1 mRNA, complete cds j 


1 
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a 
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DNA GYRASE SUBUNIT B | 


DNA GYRASE SUBUNIT B | 


af2SgC8.s1 Soares_total_fetus_Nb2HF8_9w Homo sapienscDNA clone IMAGE:1032830 3' similar to ] 
WP:C42D8.3 CE04204 ;contains element MER22 MER22 repetitive element ; | 
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Top Hit Descriptor 


|s-OXO-5-ALPHA-STEROID 4-DEHYDROGENASE 1 (STEROID 5-ALPHA-REDUCTASE 1) (SF 
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I 
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| 

Q 

' O 

i 
1 
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1 

1 


[HYPOTHETICAL 67.9 KD PROTEIN C6F12.08C IN CHROMOSOME I 


af23g08.s1 Soares_totalJetus_Nb2HF8_9w Homo sapiens cDNA clone IMAGE:1032B30 3' simile 
i WP:C42D8.3 CE04204 ;contains element MER22 MER22 repetitive element ; 


j Rattus norvegicus neuromedin U precursor (NmU) gone, exons 5 and 6 


jXenopus laevis rhodopsin gene, complete cds 


! Agarieus bisporus mRNA for tyrosinase 


Homo sapiens calcium channel alphal E subunit (CACNA1E) gene, exons 7-49, and partial cds, at 
; 5 pliced 


Mus musculus dipeptldyi aminopeptidase-like protein 6 (Dpp6) gene, partial cds; and proximal Run- 
inversion breakpoint 


1 

1 
% 

1 

1 

I 

i 
l 


Hordeum vulgare gene encocing cysteine proteinase 


Bos taurus micromolar calcium activated neutral protease 1 (CAPN1 ) gene, exons 11 -20, and parti: 


Bos taurus micromolar calcium activated neutral protease 1 (CAPN1 ) gene, exons 1 1 -20, and partt 


Arabidopsis thaliana DNA chromosome 4, ESSA 1 FCA contig fragment No, 6 


o 

E 


UI-H-BI3-alx-d-09-0-Ul.s1 NCI CGAP_Sub5 Homo sapiens cDNA clone IMAGE:3068969 3' 


Mus musculus subtiiisin-Iike serine protease LPC (PC7) gene, exons 1 to 9, partial cds 


Homo sapiens cell cycle protein (PA2G4) gene, exons 2 though 5 


SRB-11 PROTEIN 


501 581 891 F1 NIH_MGC_7 Homo sapiens oDNA clone IMAGE:3936382 3 


601581891F1 NIH_MGC 7 Homo sapfens oDNA clone IMAGE:3936382 5' 


i 

i 
i 


Human elastin (ELN) gene, partia cds, and LIM-kinase (LIMK1 ) gane, complete cds 


insulin-like growth factor-binding protein 4 [cattle, pulmonary artery endothelial cells, mRNA, 2028 r 


B-CELL RECEPTOR CD22 PRECURSOR (LEU-14) (B-LYMPHOCYTE CELL ADHESION MOLI 
(BL-CAM) 
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Top Hit Descriptor 


Amanita muscaria mRNA for SCIII25 protein | 


iCM4-HT0243-08119S-037-e01 HT0243 Homo sapiens cDNA j 


IS.cerevisise MET, LEU4, and POL1 genes encoding MET4 protein, aipha-isoproplymalate (alpha-IPM) I 
synthetase (partial), and DNA polymerase alpha (partial) [ 


601 144885F2 NIH_MGC_19 Homo sapiens cDNA clone IMAGE:3160412 5' ! 


hg77g1 1.x1 NCI_CGAP_Kid1 1 Homo sapiens cDNA clone IMAGE:2951684 3" | 


'hg77g11.x1 NCi CGAP Kidl 1 Homo sapiens cDNA clone IMAGE:2951684 3' ! 


Homo sapiens mRNA for KIAA0630 protein, pertial cds j 


Homo sapiens thioredoxin-related protein mRNA, complete cds | 


Or.corhynchus tshawytscha isolate T-20 somatolactin precursor gone, exon 1 


Or.corhynchus tshawytscha isolate T-20 somatolactin precursor gene, exon 1 I 


MCKUSICK-KAUFMAN/BARDET-BIEDL SYNDROMES PUTATIVE CHAPERONIN | 


MCKUSICK-KAUFMAN/BARDET-BIEDL SYNDROMES PUTATIVE CHAPERONIN 


Molluscum contagiosum virus type 1 ORF1 and ORF2 DNA | 


OVARIAN TUMOR LOCUS PROTEIN 


IVWl4d02.r1 Soares_placenta_8tD9weeks_2NbHP8to9W Homo sapiens cDNA clone IMAGE:252195 5' I 
similar to gb:M3S072 60S RIBOSOMAL PROTEIN L7A (HUMAN); 


IMus musculus mRNA for NIPSNAP2 protein | 


Mus musculus TANK binding kinase TBK1 (Tbk1 ) mRNA, complete cds | 


Homo sapiens MHC class 1 region | 


Homo sapiens MHC class 1 region | 


Drosophila melanogaster Na/K-ATPase beta subunit isoform 4 (JYbeta2) mRNA, complete cds 


MELANOCYTE STIMULATING HORMONE RECEPTOR (MSH-R) (MELANOTROPIN RECEPTOR) 1 
(MELANOCORTIN-1 RECEPTOR) (MC1-R) 


Mus musculus putative collagen alpha-2 (XI) chain (COL11A2) gene, partial cds I 


1 
1 

g 
d 

i 
1 
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jNEURONAL MEMBRANE GLYCOPROTEIN M6-B | 


1 
f 


Drosophila melanogaster putative inorganic phosphate cotransporter (Picot) gene, partial cds; putative sodium 
channel (Nach) and putative amyfase-related protein (Amyrel) genes, complete cds; and putative serine- 
enriched protein (gprs) gene, partial cd> 


IDrosophila melanogaster putative inorganic phosphate cotransporter (Picot) gene, partial cds; putative sodium 
'channel (Nach) and putative amyfase-related protein (Amyrel) genes, complete cds; and putative serine- 
enriched protein (gprs) gene, partial cd> 
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Top Hit Descriptor 


Bacillus halodurans genomic DNA, section 11/14 ] 


Bscillus halodurans gencmic DNA, section 11/14 ' 


xn.31h03.x1 NCI CGAP_Kid11 Homo sapiens cDNA clone IMAGE:2692469 3' similar to SW:LYAR_MOUSE 
Q08288 CELL GROWTH REGULATING NUCLEOLAR PROTEIN. ;contains MER22.M PTR5 repetitive 
element; 


KK9872F Human fetal heart, Lambda ZAP Express Homo sapiens cDNA clone KK9872 5' similar to [ 
EST(CLONEC-0PE11) 


1 
| 

1 
I 

9 

§ 
1 
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| 
1 
1 

1 
§ 

H 
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Tt-ermotoga maritima section 23 of 136 of the complete genome I 


Staphylococcus aureus partial pta gene for phosphate actyitransferase allele 1 5 | 


8 
S 

1 

| 

1 


602072473F1 NCl_CGAP_Brn67 Homo sapiens cDNA clone 1MAGE:421 5091 5' | 


Saimiri boliviensis olfactory receptor (SB027) gene, partial cds i 


Mus musculus gene for oviductal glycoprotein, complete cds | 


Neisseria meningitidis serogroup A strain Z2491 ccmplete genome; segment 7/7 j 


G.gallus mRNA for nicotinic acetylcholine receptor (nAChR) beta3 subunit ] 
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Top Hit Descriptor 
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,VIR3-OT0j07-0e0300-104-b02 OT0007 Homo sapiens cDNA 


i Neisseria men rgi di t- | - - yc ion 50 of 206 of the complete genome 


jhomo saoie-io imerf«'or>-mduced protein p to (MX1 ) gene, complete cds 


Homo saoiens cbrcmsccns 21 segment HS21C07B 


\^ic*tn iW"rte l?-)hc-n> 'lelin-' ?nd oelta-2 cryralii geies complete ror. 
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;hd/£d05.x1 Soare-s_NFL_-"_GBG_S1 Homo sapiens cDNA clone IMAGF:291 2457 3' similai toe 
repetitive element;containo L1 .12 L1 repetitive element ; 


Homo sapiens chromosome 1 2 open reading frame 4 (C1 20RF4), mRNA 


Homo sapiens chromosome 12 open reading frame 4 (C120RF4), mRNA 
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Rabbit mRNA for fast skeletal muscle myosin heavy chain (MHC) 


Homo sapiens partial LIMD1 gene for LIM domains containing prote:n 1 and KIAA0851 gene 


Homo sapiens partial LIMD1 gene for LIM domains containing protein 1 and KIAA0851 gene 
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hg33102.x1 N CI_CGAP_GC6 Homo sapiens cDNA clone IMAGE:294741 9 3' | 


hg33fC2.x1 NCI_CGAP_GC6 Homo sapiens cDNA clone IMAGE:294741 9 3' j 


Mus musculus ribosomal protein S19 (Rps19) gene, complete cds 
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Ratlus norvegicus synaptic vesicle protein (SV2) mRNA, complete cds i 


Ratlus norveglcus synaptic vesicle protein (SV2) mRNA, complete cds | 
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PROTEIN-L-ISOASPARTATE O-METHYI TRANSFERASE (PROTEIN-BET A-ASPARTATE 
METHYLTRANSFERASE) (PIMT) (PROTEIN L-ISOASPARTYL METHYLTRANSFERASE) (L- 
ISOASPARTYL PROTEIN CARBOXYL METHYLTRANSFERASE) 
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Top Hit Descriptor 
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Top Hit Descriptor 


1 Bacillus sfearothermophilus beta-1 ,4-mannanase (manF), esterase (esfA), transcript; 
|alpha-galactosidase (galA) genes, complete cds 


|R.norvagicus mRNA for 3'UTR of ubiquitin-like protein 


|R.norvsgicus mRNA for 3'UTR of ubiquitin-like protein 
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IDYNAMIN 


S5D1348090F1 NIH MGC 55 Homo sapiens cDNA clone IMAGE:4078823 5' 


|601472768T1 NiH_MGC_68 Homo sapiens cDNA clone IMAGE.3875753 3' 


501472768T1 NIH MGC 68 Homo sapiens cDNA clone IMAGE:3875753 3' 


jCRCUMSPOROZOITE PROTEIN (CS) 


Flexibacter litoralis gyr3 gene for DNA gyrase B subunit, partial cds 


1 Flexibacter litoralis gvrB gene for DNA gyrase B subunit, partial cds 


:tyS4h01 .x1 NCI CGAP Kid11 Homo sapiens oDNA clone IMAGE:22B5809 3' simile 
repetitive element;contains element L1 repetitive element ; 


;ty84h01 x1 NCl_CGAP_Kid11 Homo sapiens cDNA clone IMAGE:2285809 3' simila 
repetitive e!ement;contains element L1 repetitive element ; 


'J2498F Human fetal heart, Lambda ZAP Express Homo sapiens cDNA clone J2498 


6O2140372F1 NIH MGC 46 Homo sapiens cDNA clone IMAGE:4301800 5' 


:601873281F1 NlH_MGC_54 Homo sapiens cDNA clone IMAGE:4097180 5' 


AU 12S1 1 5 NT2RP1 Homo sapiens cDNA clone NT2RP1Q001 30 5' 


AU 1 261 1 5 NT2RP1 Homo sapiens cDNA clone NT2RP1 0001 30 5' 


MITOGEN-ACTIVATED PROTEIN KINASE KINASE KINASE 1 (MAPK/ERK KINA 
KINASE 1) (MEKK 1) 


CM3-ET0041-180500-187-d10 ET0041 Homo sapiens cDNA 


CM3-ET0041-1 80500-1 87-d10 ET0041 Homo sapiens cDNA 


za57h01.s1 Scares fetal lung NbHL19W Homo sapiens cDNA clone IMAGE:2976< 


RC4-TN0077-250800-01 1 -g04 TN0077 Homo sapiens cDNA 


Home sapiens high-mobility group phosphoprotein (HMGI-C) gene, exons 1-3, compi 
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Top Hit 
Database 








o 


I 


HUMAN 


I 


1 


I 












1 


HUMAN | 


| 


| 


I 


| 


5 


| 


1 








HUMAN | 










- 








EST 






















i 






i 














Top Hit Acession 
No. 


S 
1 






_ 

m 


1 


|BF213873.1 


1 


1 

".. 


1 


l 


1 

I 




< 


i 






IAU126115.1 j 




§ 

a 


i 




i 
§ 


1 






1 




& 

1 


Most Similar 
(Top) Hit 
BLAST E 


9 






9 








3.3E-01 




3.3E-01 


9 


9 


9 


9 


9 


9 


9 


3.3E-01 


9 


9 


l 


9 


1 


9 


3.3E-01 


3.3E-01 


3.3E-01 


S 


Expression 
Signal 


1 






o 
























§ 














a 


&. 




3 






a . 
to ° 








3 


a 








12487 


8 




8 


I 


I 

s 




1 


I 






1 


1 


S 


1 




7539 1 


l 




1 
































































8 


§ 


1 




3 


8 




1 


1 




% 


I 




1 


1 








1 




I 


1 




1 


1 


B 


s 


























































IP 




§ 


5507 






§ 






1 






721 el 


1 


I 




I 




s 


B 

s 


1 




8 


§ 


1 






1 


§ 





























































169 



WO 01/57273 



PCT/LS01/00664 




WO 01/57273 



PCT/LS01/00664 



Top Hit Descriptor 


|AV718037 FHTA Homo sapiens cDNA clone FHTAABH01 5' ! 


|Human mRNA for KIAA0361 gene, K1AA0361 protein \ 


|Homo sapiens partial LV101 gene for LIM dnmain only 1 protein, exon 1 j 
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|6018S5580F1 NIH MGC 57 Homo sepiens cDNA clone IMAGE:4075627 5' | 


] Deinococcus radiodurans R1 section 1 52 of 228 of the complete chromosome 1 i 


| Oryctolagus cuniculus Ig H-chain pseudogene, V-region (VH6-a2) gene, partial cde j 


j Oryctolagus cuniculus Ig H-chain pseudogene, V-region (VH6-a2) gene, partial cds 


]Human monocyte antigen CD14 (CD14) mRNA, complete cds 


jHomo sapiens 6-phosphofructo-2-kinase/fructose-2,6-bispriosphatase(PF2K) gene, exons 12 and 13 I 


Lomo sapiens 6-phosphofructe-2-kinase/fructose-2,6-bisphosphatase (PF2K) gene, exons 1 2 and 1 3 | 
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Top Hit Descriptor 


jS014B2059F1 NIH MGC 38 Homo sapiens cDNA clone IMAGE:3884559 ! 


Rattus norvegicus A-kinase ancho'ing protein AKAP1 50 mRNA, complete c 


Guira guira oocyte maturation factor Mos (c-mos) gene, partial cds 
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601148733F": N1H_MGC_19 Homo sapiens cDNA clone IMAGE:3163688 ! 


Human mRNA for serine/threonine protein kinase, complete cds 


3V1 CT0364 1 D200-065-b05 CT0364 Homo sapiens cDNA 
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Escherichia coli K-1 2 MG1 655 section 384 of 400 of the complete genome 


Escherichia coli K-1 2 MG1655 section 384 of 400 of the complete genome 


Arabidopsis thaliana DNA chromosome 4, contig fragment No. G5 
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B.teurus microsateilite (E _ H121) 


Pyrococcus horikoshii OT3 genomic DNA, 777001-994000 nt. position (4/7) 


Borrelia burgdorferi (section 66 of 70) of the complete genome 


Pseudomonas aeruginosa PA01, section 11 of 529 of the complete genome 


ov44g10.x1 Soares_testis_NHT Homo sapiens cDNA done IMAGE:164D2S 
repetitive element;contains element MER22 repetitive element ; 


Mus musculus chromosome X contigA; putative Magea9 gene, Caltractin, N 
and Zinc finger protein 185 


RNA POLYMERASE BETA SUBUNIT (LARGE STRUCTURAL PROTEIN 


Hepatitis G virus isolate 60 (SZNAE12) polyprotein precursor, gene, partial < 


Bovine adenovirus 3 complete genome 
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repetitive dement;oontains element LTR5 repetitive element ; 


te32c02,x1 Soares_NFL_T_GBC S1 Homo sapiens cDNA clone IMAGE:2< 
060392 R321 84 3.; 


te32c02.x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGED 
060392 R32184_3. ; 
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Top Hit Descriptor 


|60'8807S4F1 NIH_MGC_55 Homo sapiens cDNA clone IMAGE:4109350 5' | 
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|60"852148F1 NIH MGC 56 Homo sapiens cDNA clone IMAGE: 4076026 5' i 


| Drosophiia heteroneura fruitless (fru) gene, alternative splice products, 5' flanking region, exons 1 through 7 | 
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Iyh21h1 1 .r1 Soares placenta Nb2HP Homo sapiens cDNA clone IMAGE:130437 5' similar to contains LTR3 I 
I repetitive element; | 


| Mus musculus DNA for prostaglandin D2 synthase, complete cds j 


< 

§ 

1 

§ 
| 

1 
| 

1 
1 


| Rattus norvegicus CDK'04 mRNA | 


jzx39b1 0 s1 Soares_total_fetus_Nb2HF8_9w Homo sapiens :DNA clone IMAGE:788827 3' similar to j 


| Ipomoea purpurea transposable element TiplOO gene for transpocase, complete cds | 


|G.lambliaSR2gene \ 


]zd22h10r1 Soares_fetaI_heart_NbHH19W Homo sapiens cDNA clone IMAGE:341 443 S \ 


IGAG POLYPROTEIN [CONTAINS: INNER COAT PROTEIN P1 2; CORE PROTEIN P1 5; CORE SHELL I 
|PROTEIN P30; NUCLEOPROTEIN P10] j 


j Rattus norvegicus vesicular monoamine transporter type 2, promoter region and exon 1 | 


|Feline immunodeficiency virus envgene, isolate ITTO083PIU (M83), partial | 
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|Mus musculus serine protease inhibitor 14 (Spi14) mRNA, complete cds [ 


jCM1-HT0875-060800-335-eO5 H~0875 Homo sapiens cDNA | 


|Corynebacterium glutamieum metK gene, ORF1 (partial) and ORF2 (partial) j 
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J Homo sapiens DiGeorge syndrome critical region, telomeric end | 


we29f05.x1 NCI CGAP Lu24 Homo sapiens cDNA clone IMAGE:234252S 3' similar to TR:Q13538 Q13538 I 
ORF2: FUNCTION UNKNOWN. ; j 


1 Trifjcum aestivum ( Wcs83) gene, complete cds | 


| RC1 -CT0286-230200-01 6-e03 CT0286 Homo sapiens cDNA j 


|wf11g03.x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:2350324 3' j 
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| Mouse testis-specific prolan (TPX-1 ) gene, exon 1 0 j 


|Homo s3plcnc matrix metalloproteinase MMP Rasi-1 gene, promoter region | 


|Hamo sapiens matrix metalloproteinase MMP Rasi-1 gene, promoter region j 


| Hordeum vulgare rec-eptoi-like kinase LRK1 0 gene, partial cds j 


|Hordeum vulgare receptor-like kinase LRK1 0 gene, partial cds j 
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|Porphyra purpurea chloropiast, complete genome | 


Ixg40d 0.x1 NCLCGAP_Uf1 Homo sapiens cDNA clone IMAGE:2630034 3' similar to contains Alu repetitive I 
|elBment;contains element MSR1 repetitive element ; j 
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IHuman mRNA for KIAA0124 gene, partial cds j 


j Zea mays cellulose synthase-4 (CesA-4) mRNA, complete cds j 
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Ion70d04.s1 Scares NFL_T GBC^SI Homo sapiens cDNA clone 1MAGE:1562023 3' | 


I602132442F1 NIH MGC 81 Homo sapiens cDNA clone 1MAGE;4271 578 5' j 


|Homo sapiens KIAA0851 gene (partial), XT3 gene and LZTFL1 gene | 


(Homo sapiens KIAA0851 gene (partial), XT3 gene and LZTFL1 gene ( 


[Homo sapiens FLI-1 gene, partial < 


Mesemoryanthemum crystalllnum putative potassium channel protein Mkt1 p mRNA, complete cds | 


Zaocys dhumnades fruclose-1 ,6-bisphosphatase mRNA, complete cds I 


Homo sapiens serine paimitoyl transferase, subunil II gene, complete cds; and unknown genes | 


IMMUNOGLOBULIN A1 PROTEASE PRECJRSOR (IGA1 PROTEASE) | 


Aquifex aeolicus section 1 2 of 109 of the complete genome j 


7h23d04.x1 NCI_CGAP_Co16Homo sapiens cDNA clone IMAGE:3316807 3' similar to SW;PRSB XENLA 
042586 26S PROTEASE REGULATORY SUBUNIT 6A ; | 
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Top Hit Descriptor 


| Mus musculus tulip 1 mRNA, complete cds j 


|502132210F1 N!H MGC 81 Homo sapiens cDNA clone IMAGE.4271547 5' j 


|Homo Dapiens mRNA for KIAA1512 protein, partial cds | 


7k30bOB x1 NCl_CGAP_Ov1 8 Homo sapiens cDNA clone IMAGE:3476699 3' similar to SW:GAG_SMSAV 
P03330 GAG POLYPROTEIN [CONTAINS: CORE PROTEIN P15; INNER COAT PROTEIN P12; CORE 
SHELL PROTEIN P30; NUCLEOPROTEIN P10]. ; 


| C.familiaris romi gene ■ | 




|23S rRNA [Leucoriostoc carnosum, Genomic, 2866 nt] j 


Ias27e12,x1 Barstead aola HPLRBS Homo sapiens cDNA clone IMAGE:231844S 3' similar to gb:X1 3238 I 
[CYTOCHROME C OXIDASE POLYPEPTIDE VIC PRECURSOR (HUMAN); | 


]as27e12.x1 Barstead aorta HPLRB6 Homo sapiens cDNA clone IMAGE:2318446 3' similar to gb:X1 3238 | 
CYTOCHROME C OXIDASE POLYPEPTIDE VIC PRECURSOR (HUMAN); 


I Oryctolagus cuniculus cytochrome oxidase subunit Via (coxVla2) mRNA, complete cds: nuclear gene for I 
| mitochondrial product | 


Ias42f12.x1 Barstead aorta HFLRB6 Homo sapiens cDNA clone IMA GE.231 9887 3' similar to contains Alu I 
[repetitive element; j 


|Homo sapiens hyporhnr i ro i F J20345 (FLJ20345), mRNA | 
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| Glycine max resistance protein LM17 precursor RNA, partial cds j 


|AV719681 GLC Homo sapiens cDNA clone GLCDGB08 5" | 


|AV71S681 GLC Homo sapiens cDNA cone GLCDGB08 5' ) 


|Mus musculus myosm XV (Myo1 5), mRNA | 


|601511573F1 NIH_MGC_71 Homo sapiens cDNA clone IMAGE:391 2859 5' j 


|za12e08.f1 Scares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:292358 5' | 


| Homo sapiens protocadherin alpha cluster (LOC63960), mRNA j 


|Homo sapiens protocadherin alpha cluster (LOC63960), mRNA | 


| Streptomyces coelicolor A3(2) phosphoeno'pyruvate carboxylase (ppc) gene, complete cds | 


[ Arabidopsis thaliana DNA chromosome 4, contig fragment No. 58 j 


I Oxytricha nova maoronuolear telomere-binding protein alpha subunit (tel-alpha alanine version) gene, ! 
[complete cds | 


|xc90e06.x1 NCI_CGAP_Brn35 Homo sapiens cDNA clone IMAGE:2591554 3' | 


[ EST376533 MAGE resequences, MAGH Homo sapiens cDNA | 


| EST84061 Rhabdomyosarcona Homo sapiens cDNA 5' end similar to DnaJ homolog (GB:X63368) | 
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iHomo sapiens DNA polymerase epsilon catalytic subunil protein (POLE1) gene, exon 1a 


|Mus musculus Wrn protein (Wrn) gene, complete cds 


|AU1331 16 NT2RP4 Homo sapiens cDNA clone NT2RP4001328 5' 


| Chlamydia trachomatis section 26 of 87 of the complete genome 
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jxf14=08.x1 NCI_CGAP_Kid8 Homo sapiens cDNA clone IMAGE.-2618030 3' similar to gb:X03 
I SYNTHASE BETA CHAIN, MITOCHONDRIAL PRECURSOR (HUMAN); 


yg09a12.s1 Scares infant brain 1NIB Homo sapiens cDNA clone IMAGE:31663 3' similar to cc 
! repetitive element ; 


jP.sativum PS-IAA4/5 gene 


j Homo sapiens tubby like prctein 1 (TULP1 ) gene, exons 9-1 1 


Homo sapiens tubby like protein 1 (TULP1 ) gene, exons 9-1 1 


, Drosophila melanogaster testis-specific RNA-binding protein (bruno) mRNA, complete cds 


J Staphylococcus aureus toxic shock syndrome toxin-1 (tst), enterotatin (ent), and integrase (int) 
complete cds 


; Arabidopsis thaliana serine/threonine protein phosphatase type one (TOPP8) gene, complete 0 


Zeamays starch branching enzyme 1 (sbel) gene, complete cds 


Arabidopsis thaliana DNA chromcsome4, contig fragment No, 57 


Homo sapiens mRNA -"or KIAA1 1 98 protein, partial cds 


Marsupial cat beta-globin gene mRNA, partial cds 
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Arabidopsis thaliana DNA chromosome 4, contig fragment No, 1 5 


Arabidopsis thaliana DNA chromosome 4, contig fragment No. 1 5 


Homo sapiens calcium channel alphalE subunit (CACNA1 E) gene, exons 7-49, and partial cdE 


Rattus norvegicus sodium channel 1 mRNA, complete cds 


Homo sapiens partial 5-HT4 receptor gene, exons 2 to 5 


Influenza A/Guangdong'243'72 nucleoprotein (seg 5) gene, 5' end 
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Top Hit Descriptor 


vx38h08.r1 Scares melanocyte 2NbHM Homo sapiens cDNA clone 1MAGE:264063 5' 
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Homo sapiens KIAA01 73 gene product (KIAA01 73), mRNA 
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nh02a05.s1 NCI_CGAP_Thy1 Homo sapiens cDNA clone IMAGE:943088 similar to contains L1.t3 
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S.ccmmune orotidine-5'-phosphate decarboxylase (URA1 ) gene, complete cds 
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iyh4Sh10.r1 Soares placenta Nb2HP Homo sapiens cDNA clone IMAGE-.133027 5' I 


|E.dispar mRNA for hexokinase (hxkl) j 


|601274604F1 NlH_MGC_20 Homo sapiens cDNA clone IMAGE:3615768 S j 


1 P.dumerilii histone gene cluster for core histones H2A, H2B, H3 and H4 j 
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I Bruges pahangi microfilaria! sheath protein SHP3 (shp3) gene, complete cds 


iysO2g06.s1 Scares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:2' 


|Riokettsia prowazekii strain Madrid E, complete genome; segment 1/4 


1 

1 
| 

s 

'a. 

1 
I 

i 

§' 
I 

1 


1 
1 

1 
| 

1 

| 

I 

5 

1 

I 
1 


|SQ09440B7T1 NIH_MGC_17 Homo capicnc cDNA clone IMAGE:2960248 3' 


| Mesocricetus auratus oviductin precursor (OVI) gene, complete cds 


I 
I 

i" 

E 


Escherichia coli 01 57:H7 genomic DNA, Sakai-VT2 prophage inserted region 


301569022F1 NIH MGC 21 Homo sapiens cDNA clone IMAGE:38439S4 5' 


PROBABLE PROCESSING AND TRANSPORT PROTEIN LJLSB (HFLFO PROT 


COLLAGEN ALPHA 3(IV) CHAIN PRECURSOR 


QV3-BN0047-020800-284-d08 BN0047 Homo sapiens cDNA 


QV3-BN0047-020800-284-d08 BN0047 Homo sapiens cDNA 


Botrytis cinerea strain T4 cDNA library under conditions of nitrogen deprivation 


Homo sapiens homogentisate 1,2-dioxygenasegene, complete cds 


Pseudomonas putida long-chain-iatty-acid-CoA ligase (fadD) gene, complete cds 


Homo sapiens cleavage and polyadenylation specificity factor 3, 73kD subunit (CP 


Homo sapiens cleavage and polyadenylation specificity factor 3, 73kD subunit (CP 


RC2-BN0032-120200-01 1 -a1 0 BN 0032 Homo sapiens cDNA 


I 
1 

! 


Homo sapiens neuroligin 3 isoform gene, complete cds, alternatively spliced 


Homo sapiens neuroligin 3 isoform gene, complete cds, alternatively spliced 


1 
o 

i 
i 

1 
| 

1 

1 

! 
1 


Bacillus haiodurans genomic DNA, section 2/14 


I 
1 
J 
o 

I 

I 
f 

s 
i 

1 


EST389564 MAGE resequences, MAGO Homo sapiens cDNA 


Human class IV alcoho: dehydrogenase (ADH7) gene, exon 3 


Drosophila melanogaster mRNA for serine protease inhibitor jserpin-S), (spS gene' 


Homo sapiens chromosome 21 segment HS21C084 


Homo sapiens solute carrier family 7 (cationic amino acid transporter, y+ system), i 
mRNA 


Top Hit 
Database 


| 




| 




| 

a 




| 








| 




O 
1 


| 


| 












|EST_HUMAN | 








| 




| 

a 


■EST HUMAN | 
















































1 


1 
























'27203 


Top Hit Ao 
No. 


1 
l 


8 

1 


1 


|AJ23G270." 


i 


i 
i 


i 


1 
S 


l 


i 




1 


1 

s 


1 
1 


1 


|AL1 14656- 


I 


1 


1 


§ 


|AF217413.' 




1 


1 


I 
1 


1 


1 


i 


1 

s 

< 




Similar 
STE 


9 


9 




V 


■ 


9 


s 




s 






9 


9 




9 


V 


9 
ii! 


9 

Hi 


9 
li! 






9 


9 

Si! 


9 

Si! 


9 

K 


9 

ii! 


9 


9 

Si' 


9 

ii! 


9 

ii! 


9 


.7E-01 




































































Expression 








S 


s 


1 






o 




1 
































° 












ORFSEQ 
ID NO: 


1 


E 






s 
1 


§ 












s 


| 


34583! 


s 


1 




1 


8 


| 35131 1 




1 


1 


1 


8j 


1 




1 








3719o| 


SEQJD 
NO: 


1 


1 




1 


I 


1 


I 


1 


1 




I 


















s 


8 


1 


I 


1 




1 
1 






8 






23765 | 




1 


K 


§ 


1 


s 
s 


s 
s 




§ 


R 






1 


s 


1 




1 


1 


1 


1 


8853| 






1 


1 


1 








§ 






1 





































































200 



WO 01/57273 



PCT/US01/00664 



S-5 



I 



ill' 



1 .6 

8" 



201 




203 



WO 01/57273 PCT/LS01/00664 



^ - 



8 d 



205 




206 



WO 01/57273 PCT/LS01/00664 



lit 



S3! 



^ - HI 

1^5 



207 



WO 01/57273 



PCT/LS01/00664 




WO 01/57273 PCT/US01/00664 



s 













1 
1 






ll 












1 toi 3 LkI 


rlington/93/UK 


































1 






Homo sapiens transcription factor IQHM enhancer 3, JM1 1 protein, JM4 protein, JM5 protein, 1 
JM10 protein, A4 differentiation-dependent protein, triple LIM domain protein 6. and synaptophy 
compie'.e cds; and L-type calcium channel a> 












HU/NLV/Gi 


HU/NLV/Gi 














-B 










Top Hit Descriptor 
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Top Hit Descriptor 
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|AV658033 GLC Homo sapiens cDNA clone GLCFIB12 3' | 
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Top Hit Descriptor 


Carassius auratus activin beta A precursor, mRNA, complete cds | 


yh35f12.r1 Soares placenta Nb2HP Homo sapiens cDNA clone IMAGE.131759 5' similar to contains Alu I 
repstitive elementcontains TAR1 repetitive element ; j 


Z.mobilis tgt and lig genes encoding tRNA guanine transgiycosvlase and DNA ligase | 


Z.mobilis tgt and llg genes encoding tRNA guanine transgiycosvlase and DNA ligase | 


SKIN SECRETORY PROTEIN XP2 PRECURSOR (APEG PROTEIN) j 


Arabidopsis thaliana DNA chromosome 4, contig fragment No. 23 | 


: zpS3b1 2.r1 Stratagene muscle 937209 Homo sapiens cDNA clone IMAGE:627743 5' | 


RC2-NT0112 120600-014-f03 NT01 1 2 Homo sapiens cDNA | 


601680551R2 NIH_MGC_83 Homo sapiens cDNA clone iMAGE:3950604 3' | 


DEOXYRIBONUCLEASE II PRECURSOR (DNASE II) (ACID DNASE) (LYSOSOMAL DNASE II) | 


ws08d01 .x1 NCI_CGAP_Kid1 1 Homo sapiens cDNA clone IMAGE:24B6577 3' similar to contains MER7.t3 I 
MER7 repetitive element ; j 


Arabidopsis thaliana DNA chromosome 4, contig fragment No. 16 f 


UI-H-BI3-alc-d-07-3-Ul.s1 NGI_CGAP_Sub5 Homo sapiens cDNA clone IMAGE:2736420 3' | 


601456301F1 NIH_MGC_66 Homo sapienscDNA clone IMAGE;3859849 5' j 


601906489F1 NIH_MGG_54 Homo sapienscDNA clone IMAGE:4134071 S' j 


QV2-NT0048-160800-316-e05 NT0048 Homo sapiens cDNA | 


Chlamydophi.a pneumoniae AR39, section 91 of 94 of the complete genome | 


an32c04.y5 Gessler Wilms tumor Homo sapiens cDNA clone IMAGE.1700358 5' j 
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| Proteus mirabilis fimbrial operon, strain HI4320 


|EST378303 MAGE resequences, MAGI Homo sapiens oDNA 


|RC5-BT0254-031099-011-a03 BT0254 Homo sapiens cDNA 


|601498088F1 NIH_MGC_70 Homo sapiens cDNA clone IMAGE:3900165 ! 


| Mus musculus lymphocyte antigen 78 (Ly78), mRNA 


|AU1370B4 PLACE1 Homo sapiens cDNA done PLACE1005740 5' 
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Top Hit Descriptor 


|Trltlcum aesttvum heat shock protein 101 (Hsp101a) mRNA, completecds : 


iHuman BRCA1, Rho7 and vatl genes, completecds, and ipf35 gene, partial cds i 


| Adnetcbaoter sp. cysD, cobQ, sodM, lysS, rubA, rubB, estB, oxyR, ppk, mtgA, ORF2 and ORF3 genes 1 


jHuman BRCA1 , Rhu7 and vatl genes, complete cds, and ipf35 gene, partial cds ! 


|Rattus norvegjcus calcium channel alpha-1C subunit (ROB2) mRNA, partial cds | 


Human pephBGT-1 betaine-GABA transporter mRNA, complete cds I 


jHcmo sapiens BAM -associated protein 3 (BAIAP3) mRNA | 


]Homo sapiens nasopharyngeal epithelium specific protein 1 (NESG1), mRNA 
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Top Hit Descriptor 


| Rattus norvegieus synap-jc vesicle protein 2C (SV2C) mRNA, complete cds 1 


s 
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| Mouse germline IgM chain gene, D region; D-q52, mu switch region (part a) 


| Mcuse germline IgM chain gene, D region; D-q52, mu switch region (part a) ] 


|INSUL1N RECEPTOR-RELATED PROTEIN PRECURSOR (IRR) (IR-RELATED RECEPTOR) | 


j Homo sapiens Snf2-re!ated CBP activator protein (SRCAP) mRNA | 


Homo sapiens Srvf2-related CBP activator protein (SRCAP) mRNA 1 


j Dictyostelium discoideum proteasome subunit C2 homolog PrtC (prtC) gene, complete cds | 


| Homo sapiens 1 4q32 Jagged2 gene, complete cds; and unknown gene j 


|hi20c08.x1 NCI CGAP GU1 Homo sapiens cDNA clone 1MAGE:297284B 3' | 


j Rattus norveg cus SPA-1 like protein p1294 mRNA, complete cds I 


^acerta media cytochrome c oxidase subunit 1 gene, partial cds; mitochondrial gene for mitochondrial product] 


ll-acerta media cytochrome c oxidase subunit 1 gene; partial cds; mitochondrial gene for mitochondrial product | 
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601893437F1 NIH_MGC 17 Homo sapiens cDNA clone IMAGE:4139216 5' I 


Archaeoglobus fulgidus section 34 of 1 72 of the complete genome j 


Bacillus stearotheimophilus BsrFI methylase (FIM) and BsrFI restriction endonuclease (FIR) genes, complete 1 


Helicobacter pylori 26695 section 130 of 134 of the complete genome j 


oq83b07.s1 NCI_CGAP KidS Homo sapiens cDNA clone IMAGE:1592917 3' similar to gb:K01144 HLA I 
CLASS li HISTOCOMPATIBILITY ANTIGEN, GAMMA CHAIN PRECURSOR (HUMAN); | 


M PROTEIN, SEROTYPE 6 PRECURSOR | 


Mus musculus phospholipase C-like protein mRNA, partial cds j 


Mus musculus myosin XV (Myo15), mRNA I 


RC4-OT0037-20070Q-014-e05 OT0037 Homo sapiens cDNA j 


RC4-OT0037-200700-01 4-e05 OT0037 Homo sapiens cDNA | 


V.ammodytss gene for ammodytoxin C j 


Homo sapiens chromosome 22 open reading frame 5 (C220RF5), mRNA | 


Homo sapiens heparanase precursor, mRNA, complete cds | 


Streptococcus mutans genefor glucose-1-phosphate uridylyltransferase, complete cds j 


Antirrhinum majus mRNA for MYB-related transcription factcr j 
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Top Hit Descriptor 


QV3-BN0046-1 50400-1 51 -e04 BN0046 Homo sapiens cDNA j 


Homo sapiens solute carrier family 6 (neurotransmitter transporter, glycine), member 9 (SLC6A9), mRNA | 


! Homo sapiens solute carrier family 6 (neurotransmitter transporter, glycine), member 9 (SLC6A9), mRNA | 
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|wq24h09.x1 NCI_CGAP_Kid1 1 Homo sapiens cDNA clone IMAGE:2472257 3' j 


|wl52b02.x1 NCI CGAP Bm25 Hcmo sapiens cDNA clone IMAGE:2428491 3 similar to gb M14328 ALPHA I 
ENOLASE (HUMAN); j 


AU11B913 HEMBA1 Homo sapiens cDNA clone HEMBA10002645' | 


7c61 c05.x1 NCI_CGAP_Pr28 Homo sapiens cDNA clone IMAGE:3578504 3' similar to contains element I 
MER27 repetitive element ; | 
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; C.fimi DSM 20113 16S rDNA j 


IRC5-LT0054-2601 00-01 1-H09 LT0054 Homo sapiens cDNA J 


| Mus musculus paired-like homeodomain transcription factor 1 (Pitxl ), mRNA j 


Wf43h01 j<1 Soares_N FL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:2358385 3' j 


Homo sapiens ADP/ATP carrier protein (ANT-2) gene, complete cds j 


Mus musculus ublquintin c-terminal hydrolase related polypeptide (Uchrp), mRNA i 


Caenorhabditis elegans mRNA for DYS-1 protein, partial [ 
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Top Hit Descriptor 


601453313F1 N!H_MGC_66 Homo sapiens cDNA clone IMAGE:3857738 5' j 


1 
1 

1 

1 
1 

:" 

1 
8, 

1, 

X 

1 


601658738R1 NIH_MGC_69 Homo sapiene cDNA clone IMAGE:3886209 3' \ 


Thsrmotoga maritima section 1 01 of 136 of the complete genome | 


CM0-NN10U4-130300-284-gC8 NN1004 Homo sapiens cDNA | 


Homo sapiens chromosome 21 segment HS21 C1 02 | 


Homo sapiens NGB gene for neuroglobin, exons 1-4 \ 


' 1 1 1 i 


2j24a02.s1 Scares fetal liver spleen 1NFLS S1 Homo sapiens cDNA clone IMAGE:451 1 78 3' similar to I 
!Bb:L02426 268 PROTEASE SUBUNIT 4 (HUMAN); j 


PROLINE-RICH PROTEIN MP-3 | 


PROLINE-RICH PROTEIN MP-3 | 


;S01896047F1 NIH MGC 19 Homo sapionc cDNA clone IMACE:4125515 5' j 


IHomo sapiens KIAA0424 protsin (KIAA0424), mRNA j 
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1 
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Homo sapiens mRNA for KIAA0518 protein, partial cds | 


|zj24a02.s1 Soares fetal liver_spleen_1NFLS_S1 Homo sapiens cDNA clone IMAGE:4511 78 3' similar to I 
gb:L02426 26S PROTEASE SUBUNIT 4 (HUMAN); | 


Melhanobacterium thermoautotrophicum from bases 10291 55 to 1 039934 (section 88 of 1 48) of the complete I 
genome | 


Methanobacterium thermoautotropriicum from bases 1 02B1 55 to 1 039934 (section 88 of 1 48) of the complete | 


Homo sapiens chromosome 21 segment HS21C101 | 


Homo sapiens chromosome 21 segment HS21 C101 | 
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Methanococous jannasohii section 73 of 150 of the complete genome | 


CALMODULIN i 
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Top Hit Descriptor 


1 Streptococcus pneumoniae putative response regulator (zmpR), putative histidine kinase (zmpS) 
|zinc metalloprotease (zmpB) genes, complete cds 


PROLINE RICH PROTEIN MP-3 


PROLINE-RICH PROTEIN MP-3 


Lactococcus lactis cspE gene 


; Human gene for sex hormone-binding globulin (SHBG) 


j AV712452 DCA Homo sapiens cDNA clone DCAAUG01 5' 


| Homo sapiens plasma membrane calcium ATPase isoform 1 (ATP2B1) gene, alternative splice f 


S01763523F1 NIH MGC 20 Homo sapienscDNA clone IMAG&4026436 5' 


hq24f 1 1 .X1 NCI_CGAP_Adr1 Homo sapiens cDNA clone IMAGE:3120333 3' similar to TR:Q9Z: 
ATYPICAL PKC SPFOIFIC BINDING PROTFIN ; 


1 
1 

! 

s 

1 

§ 
1 

0 


Homo sapiens zhc finger protein 92 (ZFP92), expressed-Xq28STS protein (XQ2BORF), and big 
igenes, complete cds; and plasma membrane calcium ATPase isoform 3 (PMCA3) gene, partial c 


601343926F1 NIH_MGC 53 Homo sapiens cDNA clone IMAGE:3685951 5' 


501055194F1 NIH MGC 10 Homo sapienscDNA clone IMAGE:3451 559 5' 


Rattus norvegicus bHLH transcription factor Misrl (Mistl ) gene, complete cds 


af81a04.r1 Soares_NhHMPu S1 Homo sapiens cDNA clone IMAGE:1048398 5' 


no05h08,s1 NCI CGAP_Phe1 Homo sapiens cDNA clone IMAGE:1099839 3' 


Homo sapiens ataxia telangiectasia (ATM) gene, complete cds 


I 
| 
1 

5 
1 

1 
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i Human immunodeficiency virus type 1 (D9) proviral structural capsid protein (gag) gene, partial c 


501872281F1 NIH_MGC_53 Homo sapienscDNA clone IMAGE:4092981 5' 


:qd92a10.x1 Soares testis NHT Homo sapiens cDNA clone IMAGE:17369223' 


;501143974F1 NIH_MGC_15 Homo sapienscDNA clone IMAGE:3051234 5' 


COLLAGEN ALPHA 1(XVI) CHAIN PRECURSOR 
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UI-H-BI1-acy-c-07-0-Ul.s1 NCI_CGAP_Sub3 Homo sapienscDNA clone IMAGE:2716020 3' 
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Top Hit Descriptor 


|qh41dC1.x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:1847233 3' 


jPenicillium urticae mitochondrial 1-rRNA (large rRNA) gene and its flanking region j 


[Homo sapiens chemokine receptor CXCR4 gene, promoter region and complete cds 1 


| Dictyostelium dlscoideum darlin (darA) gene, complete cds | 


[Human respiratory syncytial virus, complete genome 1 


'Human respiratory syncytial virus, complete genome | 


s 

1 
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1 

1 
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[Homo sapiens EWS, gar22, rrp22 and bam22 genes ] 


'Homo sapiens vinculin (VCL). mRNA | 


,MR1-SN0064-010600-C06-a12 SN00S4 Homo sapiens cDNA j 


|Mls musculus DIPB gene (Dipb), mRNA j 


'601571046F1 NIH MGC 20 Homo sapiens cDNA clone IMAGE:39541 78 5' I 


jHcmo sapienc E2F-like protein (_OC51270), mRNA ( 


jXe-icpus laevis alpha(E)-cate-iin mRNA, complete cds 


i Aquifex aeolicus section 96 of 1 09 of the complete genome | 


Izv46h1 2.s1 Soares ovary tumor NbHOT Homo sapiens cDNA clone IMAGE:756743 3' similar to gb:M26038 I 
HLA CLASS II HISTOCOMPATIBILITY ANTIGEN, DR-5 BETA CHAIN (HUMAN): | 
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|Azotobacter vinelandii ATCC 9046 negative regulator MucB ImucB) gene, partial cds 1 


601S56817R1 NIH MGC 67 Homo sapiens cDNA clone IMAGE:3865637 3' | 
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[601 82351 1 F1 NIH_MGC_77 Homo sapiens oDNA clone IMAGE:4043138 5' | 


Izr32g05s1 Soares_NhHMPu_S1 Homo sapiens cDNA clone IMAGE:665144 3' I 
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Nectria haematococca kinesin related protein 2 (KRP2) gene, complete cds l 
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Mls musculus 1 FN-response element binding factor 1 (IREBF-1 ), mRNA i 


Heterodera glycines beta-1 ,4-endoglucanase-1 precursor (HG-eng-1 ) gene, complete cds I 


Heterodera glycines beta-1 ,4-endoglucanase-1 precursor (HG-eng-1 ) gene, complete cds ] 
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.Neisseria meningitidis serogroup A strain Z2491 complete genome; segment 6/7 j 


jMus muaculus chaperonin oubunit 6a (zeta) (Cct6al, mRNA | 


|k141S.seq.FHuT.an fetal heart. Lambda ZAP Express Homo sapiens cDNA 5' > 


| AF1501 95 Human mRNA from cc34+ stem cells Homo sapiens cDNA clone CBDAIA10 | 


|RC1-OTQ083-150S00-014-g06 OT0083 Homosapiens cDNA [ 


Horro sapiens mRNA for KIAA0554 protein, partial cds ( 


iHorro sapiens DNA topoiDomeraos II beta (TOP2B) gene, exons 16, 17, and 18 | 


] Horro sapiens DNA topoiDomeraoe II beta (TOP2B) gene, exons 16, 17, and 18 [ 


Human hereditary haemochromatosis region, histone 2A-like protein gene, hereditary haemoehromatoBls 
(HLA-H) gene, RoRetgene, and sodium phosphate transporter (NPT3) gene, complete nris 


Hurran hereditary haemochromatosis region, histone 2A-like protein gene, hereditary haemochromatosis 
(HLA-H) gene, RoRet gene, and sodium phosphate transporter (NPT3) gene, complete cds 
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Drosophila melanogaster mRNA for mod(mdg4)51 .4 protein | 


Mus musculus major histocompatibility locus class III regions Hsc70t gene, partial cds; smRNP, G7A, NG23, 
MutS homolog, CLCP, NG24, NG25, and NG26 genes, complete cds; and unknown genes 


HEAT SHOCK PROTEIN 70 HOMOLOG | 


wr66g10.x1 NCI_CGAP_Ut1 Homo sapiens cDNA clone IMAGE:2492706 3' ( 


|Synechccysts sp. PCC6803 complete genome, 14/27, 1 71 9B44-1 848241 I 
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Drosophila melanogaster Domina gene, exons 1-3 ] 
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AV698070 GKC Homo sapiens cDNA clone GKCAHE01 5' ] 


|601 S73316F1 NIH_MGC_54 Homo sspiens cDNA clone IMAGE:4097499 5' j 
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Top Hit Descriptor 


] EST84266 Colon adenocarcinoma IV Homo sapiens cDNA 5" end similar to tissue-spet 
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S i 
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S 
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j Streptococcus pneumoniae parC, parE and transposase genes and ORF DNA 


j Rattus norvegicus testis specific protein mRNA, complete cds 


| RC3-BT0253-01 1 1 99-01 3-b04 BT0253 Homo sapiens cDNA 


wf48h05.x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:2358873 3' s 
L1.t1 L1 L1 repetitive element; 


|Homo sapiens stimulated trans-scting factor (60 kDa) (STAF50) mRNA 


|Homo sapiens stimulated trans-aoting factor (50 kDa) (STAF50) mRNA 


|601815274F2 NIH_MGC_56 Homo sapiens cDNA clone IMAGE:4049226 5' 


|B01874710F1 NIH MGC 54 Homo sapiens cDNA clone IMAGE:4101074 5' 


j qf5Sb08.x1 SoaresJestisJMHT Homo sapiens cDNA clone IMAGE: 1754199 3' 
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1ts78a06.x1 NCI CGAP GC6 Homo sapiens cDNA clone IMAGE:2237362 3' 




j Acipenser baeri partial IGLVgenefor Immunoglobulin light chain variable region, axons 


IEST180654 Jurkat T-cells V Homo sapiens cDNA 5' end similar to similar to heat shoe! 
like 


|ESTl80e54 Jurkat T-cells V Homo sapiens cDNA 5' end similar to similar to heat shocl 


zn87c08.r1 Stratagene lung carcinoma 937218 Homo sapiens cDNA clone IMAGE:56£ 
|gb:X69181 60S RIBOSOMAL PROTEIN L31 (HUMAN); 


Wf69h03.x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:2360885 3' s 
060298 KIAA0551 PROTEIN ; 


1 RC1 -DT0001 -290 ' 00-01 2-e1 0 DT0001 Homo sapiens cDNA 


j Mus musculus p53 tumor suppressor gene, exon 1 0 and 1 1 , partial cds; alternatively sp 


'Drosophila melanogaster LD23107 sfng (sting) mRNA, complete cds 


Mus musculus iroquois related homeobox 5 (Drosophila) (Irx5), mRNA 
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Top Hit Descriptor 


] Homo sapiens ES18 mRNA, partial cds | 


Cucumis melo polygalacturonase precursor (MPG3) mRNA, complete cds j 


Mus musculus fatty acid amide hydrolase gene, exon 10 ! 


Bacillus subtilis complete genome (section 1 of 21): from 1 to 213080 | 


SALIVARY ACIDIC PROLINE-RICH PHOSPHOPROTEIN 1/2 PRECURSOR (PRP-1'PRP-3) (PRP-2/PRP-I 
|4) (PIF-F/PIF-S) (PROTEIN A/PROTEIN C) [CONTAINS: PEPTIDE P-C] | 


Oryctolagus cuniculus UDP-g ucuronosyltransferase (UGT2B13) mRNA, complete cds \ 


Mus musculus Unc-51 like kinase 2 (C. elegans) (Ulk2), mRNA j 


Haemophilus influenzae Rd section 97 of 1 63 of the complete genome \ 


Antheraea pernyi period clock protein homolog mRNA, complete cds j 


CASEIN KINASE II BETA CHAIN (CK II) j 


Homo sapiens ubicuitous tetrstricopeptide containing protein RoXaN mRNA, partial cds j 


Gallus gallus tyrosine kinase JAK1 (JAK1 ) mRNA, complete cds j 


Mus musculus Dmp-1 gene, exons 1-6 j 


NEUROFILAMENT TRIPLET L PROTEIN (NEUROFILAMENT LIGHT POLYPEPTIDE) (NF-L) | 


NEUROFILAMENT TRIPLET L PROTEIN (NEUROFILAMENT LIGHT POLYPEPTIDE) (NF-L) j 


MR0-CT0064-1C0899-002-g1 3 CT0064 Homo sapiens cDNA j 


Mus musculus Fas-interacting serine/threonine Kinase 3 (Fist3) mRNA, complete cds j 


Methanocoecus jannaschii section 142 of 150 of the complete genome ! 


Chicken 28-kDa vitamin D-dependent calcium-binding protein (CaBP-28) mRNA, complete cds j 


Homo sapiens ABCA1 (ABCA1) gene, complete cds I 


Homo sapiens ABCA1 (ABCA1) gene, complete cds j 


Zes mays phytoene synthase (Y1 ) gene, complete cds j 


ATROPHIN-1 (DENTATORUBRAL-PALLIDOLUYSIAN ATROPHY PROTEIN) ! 


zq48a1 2.s1 Stratagene hNT neuron (#937233) Homo sapiens cDNA clone IMAGE:632326 3' similar to 
contains Alu repetitive eIement;contains element MSR1 repetitive element ; 
Zt78a03.s1 Soares_testis_NHT Homo sapiens cDNA clone IMAGE:728428 3' 


Zt78a03.s1 Soares_testis_NHT Homo sapiens cDNA clone IMAGE:728428 3' 1 
Danio rerio de novo DNA melhyitransferase 3 (dnmt3) mRNA, partial cds f 


i Danio rerio de novo DNA methyltrsnsferase 3 (dnmtS) mRNA, partial cds | 


i DrcsophJa melanogaster developmental protein (rough) gene, complete cds | 


|xg56g10.x1 NCI_CGAP_Ut4 Homo sapiens cDNA clone IMAGE2632386 3' | 


xg56g10x1 NCI_CGAP_Ut4 Homo sapiens cDNA clone ]N/AGE:2632386 3' | 
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Top Hit Descriptor 


neB7f04.s1 NCI_CGAP_Kid1 Homo sapiens cDNA clone IMAGE:91 1263 j 


yh63d04.s1 Soares placenta Mb2HP Homo sapiens cDNA done IMAGE:134407 3' | 


QV4-NN0038-270400-187-h05 NIM003B Homo sapiens cDNA j 


Ratfus norvegicus UDP-Gatglucosytceramide beta-1 ,4-galactosyltransferase mRNA, complete cds | 


601338428F1 NIH_MGC_53 Homo sapiens cDNA clone IMAGE:3680695 5' I 


601338428=1 NIH MGC 53 Homo sapiens oDNA clone IMAGE:3680695 5' | 


yu07e10.r1 Soares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:2331 30 5' | 


MULTIDRUG RESISTANCE-ASSOCIATED PROTEIN 5 (ABC TRANSPORTER MOAT-C) (PABC1 1 ) I 
(SMRP) ] 


S.vulgare pepC gene for PEP carboxylase [ 


S.vulgare pepC gene for PEP carboxylase | 


yf25c09.r1 Soares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:127888 5' \ 


Sus scrofa deoxyribonuclease II mRNA, complete cds | 


601452661 =1 NIH MGC 66 Homo sapiens cDNA clone IMAGE:3856593 5' 


Meisseria meningitidis DNA for region 2 (friaB- and fhaC-homologs, unknown genes) and flanking genes, | 


601140729=1 NIK_MGC_9 Homo sapiens cDNA clone IMAGE:3049830 5' | 


HUMNK262 Human epidermal ke-atinocyte Homo sapiens cDNA clone 262 j 


Buchnera aphldicola natural-host Schlechtendalia chinensis gluconate-6-phosphate dehydrogenase (gnd) 1 
gene, partial cds | 


Buchnera aphldicola natural-host Schlechtendalia chinensis gluconate-6-phosphate dehydrogenase (gnd) I 
gene, partial cds | 
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EST383706 MAGE resequences, MAGN Homo sapiens cDNA | 


Aeropyrum pernix genomic DNA, section 777 | 


Sheep gene for ultra high-sulphur keratin protein | 


AU135S17PLACE1 Homo sapiens cDNA clone PLACE1002962 5' | 


EST382234 MAGE resequences, MAGK Homo sapiens cDNA j 


Homo sapiens retinal fasoin (FSCN2) gene, exon 2 | 


Homo sapiens retinal fasoin (FSCN2) gene, exon 2 | 


601594078F1 NIH_MGC_9 Homo sapiens cDNA clone IMAGE:3948067 5' | 


yd21b08.r1 Soares fetal liver spleen 1NFLS Homo sapiens oDNA clone IMAGE:108855 5' j 
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Top Hit Descriptor 


jHEyoCYTIN PRECURSOR (HUMORAL LECTIN) | 


HEMOCYTIN PRECURSOR (HUMORAL LECTIN) j 


] Human retrotransposon 3' long terminal repeat j 


Iyu12c05.s1 Scares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:233576 3' similar to contains I 
j Alu repetitive element;contains A3R repetitive element ; j 


za35g11.s1 Scares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:294596 3' similar to j 
|gb|KQ2909|RATSR7K Rat (rRNA);contains A3R.b1 A3R repetitive element ; | 


j Borrelia burgdorferi (section 1 1 of 70) of the complete genome | 


zu91c06.s1 Scares testis NHT Homo sapiens cDNA clone IMAGE:745354 3' similar to gb:J04422 ISLET 
AMYLOID POLYPEPTIDE PRECURSOR (HUMAN);contains Alu repetitive element;contains element XTR 
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nh07b1 2 s1 NCI_CGAP_Thy1 Homo sapiens cDNA clone IMAGE:943583 similar lo contains Alu repetitive 1 
| element-contains element PTR5 repetitive element ; j 


Mus musculus major histocompatibility locus class III regions Hsc70t gene, partial cds; smRNP, G7A, NG23, 
MutS homolog, CLCP, NG24, NG25, and NG26 genes, complete cds; and unknown genes 


Mus musculus major histocompatibility locus class 111 regions Hsc70tgene, partial cds; smRNP, G7A, NG23, 
Mu:S homolog, CLCP, NG24, NG25, and NG26 genes, complete cds; and unknown genes 


| Bacteriophage blLQ7, complete genome | 


| Mus musculus DinB homolog 1 (E. cell) (Dinbl ). mRNA [ 


| Rattus norvegious cAMP-regulated guanine nucleotide exchange factor 1 (cAMP-GEFI) mRNA, complete cds | 


| Rattus norvegious cAMP-regulated guanine nucleotide exchange factor 1 (cAMP-GEFI) mRNA, complete cds | 


j Caenorhabditis elegans mRNAfor iron-sulfur subunit of mitochondrial succinate dehydrogenase, complete j 
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14 Homo sapiens mammary tumor-associated protein INT6(INT6)gene, exon 4 I 


1 HSAAACADH P, Human foetal Brain Whole tissue Homo sapiens cDNA j 


[Canis beta-galactosides-blnding lectin (LGALS3) mRNA, 3'end | 
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Top Hit Descriptor 


| {microsatellile INRA41} [Ovis aries=sheep, Genomic, 361 nt, segment 1 of 2] 


| Homo sapiens putative Rab5 GDP/GTP exchange factor homologue (RABEX5), mRNA 


|qb22a08j(1 Soares_pregnant_uterus_NbHPU Homo sapiens cDNA clone IMAGE:1 696982 3' 


Ihm45a04.x1 N C i CGAP RD F1 Homo sapiens cDNA clone IMAGE:3015534 3' similar to contair 
MER19.b1 MER19 repetitive element ; 


| HISTIDINE-RICH GLYCOPROTEIN PRECURSOR 


ac19fD4.s1 Stratagene ovary (#93721 7) Homo sapiens cDNA clone IMAGE:856927 3' similar to . 
JrepBtitivBelementcontains element MER24 repetitive element ; 


Ve86f08.r1 Soares fetal liver soleen 1NFLS Homo sapiens cDNA clone IMAGE:124647 5' 


qmu8gC7.x1 NCI_CGAP Lu5 Homo sapiens cDNA clone 1MAGE:1881276 3' similar to gb:X523 
FINGER PROTEIN 30 (HUMAN); 


;hf34a03.x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:2933740 3' similar to c 

|L1.t1 L1 repetitive element ; 


j Messenger RNA for anglerfish (Lophius americanus) somatostatin 11 
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| Rattus norvegicus N-arginine dibasic convertase 1 (Nrd1), mRNA 


Iwg39f09jc1 Soares_NSF_F8_9W_OT_PA_P_S1 Homo sapiens cDNA clone IMAGE:2367113 ! 
(contains Alu repetitive element; 


j T.niveum (ATCC34921) simA gene for cyclosporine synthetase 
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Macaca fascicularis protein tyrosine phosphatase (PRL-1 ) mRNA, complete cds 


Homo sapiens nebulin (NEB), mRNA 


| Human apolipoprotein (a) gene, exon 1 


| Human apolipoprotein (a) gene, exon 1 


| Homo sapiens hype'ion gene, exons 1 -50 


]Caeno-habditis elegans cCAFI protein gene, complete cds 


j DKFZp434I0314_r1 434 (synonym: htes3) Homo sapiens cDNA clone DKFZp434l0314 5' 


jHomo sapiens serum constituent protein (MSE55), mRNA 


jCM4-NN1030-0404DO-iao-f03NN103C Homo saoienscDNA 


oe08d04,s1 NCI_CGAP_Ov2 Homo sapiens cDNA clone IMAGE:1385287 similar to contains elf 
| repetitive element ; 


| Mycobacterium tuberculosis H37Rv complete genome; segment 13/162 


[Treponema maltophilum flaB2, flaB3 andfliD genes for flageHn subunit proteins and CAP proteir 
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Top Hit Descriptor 


|Arebidopsis thaliana DNA chromosome 4, contig fragment No. 82 


| Arabidopsis thaliana DNA chromosome 4, contig fragment No. 82 


| Homo sapiens coagulation factor XII (Hageman factor) (F12), mRNA 


|Mus musculus histocompatibility 2, complement component factor B(H2-Bf 


] EST374761 MAGE resequences, MAGG Homo sapiens cDNA 


JEST374761 MAGE resequences, MAGG Homo sapiens cDNA 


I Homo sapiens hypothetical protein FLJ10379 (FLJ1037S), mRNA 


Homo sapiens hypothetical protein FLJ10379 (FLJ10379), mRNA 


S01567403F1 NiH MGC 21 Homo sapiens cDNA clone IMAGE:3842280 ! 


iS015S7403F1 NIH MGC 21 Homo sapiens cDNA clone IMAGE3842280 ! 


H. sapiens La/SS-B pseudogsne 3 
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nil 1 c04.s1 NCI CGAP_Br2 Homo sapiens cDNA clone IMAGE:1 029990 3 


MycoDaaenum tuDerculosis ny/Kv complete genome; segment 88/102 


Candida boidinii methanol oxidase (AOD1 ) gene, complete cds 


Homo sapiens SPP2 gene for secreted phosphoprotein 24 precursor, exons 


601078239F1 NIH_MGC_12 Homo sapiens cDNA clone IMAGE:3464241 i 


Homo sapiens chromosome 21 segment HS21C018 


Z.mays Knotted-1 (Kn-1) gene 


Human IFNAR gene for interferon alpha/beta receptor 


Arab'dopsis thaliana F21J9.2 mRNA, complete cds 


Homo sapiens sperm associated antigen 7 (SPAG7), mRNA 


Homo sapiens chromosome 21 segment HS21 C001 
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602129475F1 NIH MGC 55 Homo sapiens cDNA clone IMAGE:4286203 i 


602129475F1 NIH_MGC_56 Homo sapiens cDNA clone IMAGE:4286203 f 
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Human germline T-cell receptor beta chain TCRBV17S1 A1T, TCRBV2S1 , 
TCR3V19S1P, TCRBV15S1 , TCRBV1 1S1A1T, HVB relic, TCRBV28S1P, 
TCR3V3S1 , TCRBV4S1 A1T, TRY4, TRY5, TRY6, TRW, TRY8, TCRBD1 


Rice gene for thioredoxin h, complete cds 
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Top Hit Descriptor 


|Z.mays U3snRNA pseudogene j 


|601 459570F1 NIH_MGC_S6 Homo sapiens cDNA clone IMAGE3863177 5' j 


|60^459570F1 NIH MGC 66 Homo sapiens cDNA clone IMAGE:3863177 5' j 


j Ctthidia fascicu lata 27 kDa guide RNA-binding protein mRNA, complete cds; mitochondrial gene for j 
j mitochondrial product J 


tg55h07.x1 NCI_CGAP_Pr28 Homo sapiens cDNA clone IMAGE:2112733 3' similar to gb:X1 5183_cds1 
HEAT SHOCK PROTEIN HSP 90-ALPHA (HUMAN);contains Alu repetitive element;contains element MER5 
repetib've element ; 


AV75001S MDS Homo sapiens cDNA clone MDSBDC10 5' j 


jSPLICEOSOME ASSOCIATED PROTEIN 52 (SAP 62) (SPLICING FACTOR 3A SUBUNIT 2) (SF3A6S) | 


IRC2-DT0007-1 20200-0 16-h02 DT0007 Homo sapiens cDNA | 


Homo sapiens renal dipeptidase (RDP) gene, complete cds 1 


H .sapiens gene for Me491 /CD63 antigen ] 


|wh42f09.x1 NC!_CGAP_Kid1 1 Homo sapiens cDNA clone IMAGE:2383433 3' similar to contains element 1 
MER22 MER22 repetitive element ; ! 


I 

o 

s 
1 

< 

1 

1 

1 

s 
1 


Arabidopsis thaliana DNA chromosome 4, contig fragment No. 59 ! 


Mus musculus MHC class III protein RP1 (Rp1 ) mRNA, partial cds j 
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Top Hit Descriptor 


|Human pro-alpha1 type II collagen (COL2A1) gene exons 1-54, complete cds ] 


Izx75a03,s1 Soares ovary tumor NbHOT Homo sapiens cDNA clone IMAGE:809548 3' similar to I 
|s W :DXA2_UOUSE P1 4S85 PROBABLE DIPHENOL OXIDASE A2 COMPONENT ; | 


|602077774F1 NIH_MGC_S2 Homo sapiens cDNA clone IMAGE:4252002 5' i 
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RCC-UM0014-170400-023-G01 LM001 4 Homo sapiens cDNA | 
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Homo sapiens X28 region near ALD locus containing dual specificity phosphatase 9 (DUSP9), ribosomal 
protein L1 8a (RPL1 8a), Ca24-/Calmodulin-dependent protein kinase I (CAMKI), creatine transporter (CRTR), 
CDM protein (CDM), adrendeukodystrophy protein > 


Homo sapiens X28 region near ALD locus containing dual specificity phosphatase 9 (DUSP9Y, ribosomal • 
protein L18a (RPL18a), Ca2+/Calmodulin-dependent protein kinase I (CAMKI), creatine transporter (CRTR), 
CDM protein (CDM), adrenoleukodystrophy protein > 
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!UI-HF-BN0-alp-g-04-0-Ul.r1 NIH_MGC 50 Homo sapiens cDNA clone IMAGE:3C 


7q74c09.x1 NCI_CGAP_Lu24 Homo sapiens cDNA clone IMAGE: 3' similar to co 
olemont:contains element MER31 repetitive element ; 


hh02cC7.x1 NCI_CGAP_Kid1 1 Homo sapiens cDNA clone IMAGE:2953932 3' sir 
LTR5 repetitive element ; 


;RC3-ST0281-2404CO-015-f03 ST0281 Homo sapiens cDNA 


Homo sapiens prctein kinase CK2 cataiytic subunit alpha gene, exon 1 


]Homo sapiens protein kinase CK2 cataiytic subunit alpha gene, exon 1 
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Top Hit Descriptor 


|qm99d1 1 .x1 NC;_CGAP_Lu5 Homo sapiens cDNA clone IMAGE:1896885 3' j 


|yd77g10.r1 Soares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:1 1 4306 5' I 


| PROTEOGLYCAN LINK PROTEIN PRECURSOR (CARTILAGE LINK PROTEIN) (LP) | 


Ihf37b06.x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:2934035 3' similar to TR:Q60976 I 
|QBB976 JERKY. ; ■ | 


Iyx42g06.s1 Soares melanocyte 2NbHM Homo sapiens cDNA clone IMAGE:264442 3' similar tocontains I 
|L1.b2L1 repetitive element; j 


Iyx42g0s.s1 Soares melanocyte 2NbHM Homo sapiens cDNA clone IMAGE:264442 3' similar to contains I 
|L1.b2L1 repetitive element ; • 
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iHomo sapiens ASCL3 gene, CEGP1 gene, C11orf14gene, C1 1orf15 gene, C11orf16geneand C11orf17 I 
|gene | 


|MR2-UM0025-300300-102-f02 UM0025 Homo sapiens cDNA | 


| MR2-UM0025-300300-102-f02 UM0025 Homo sapiens cDNA ! 


IHomo sapiens mannosidase, beta A, lysosomal (MANBA) gene, and ubiquitin-conjugating enzyme E2D 3 1 
(UBE2D3) genes, complete cds i 
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yp86a09.s1 Soares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:1 94296 3' | 
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225) (TENASCIN-C) (TN-C) 


| BETA-GALACTOSIDASE PRECURSOR (LACTASE) | 


'BETA-GALACTOSIDASE PRECURSOR (LACTASE) | 


Homo sapiens caspase recruitment domain-containing protein (BCL10) gene, complete cds j 
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Top Hit Descriptor 


Caenorhabditis elegans spliced leader RNA (SL3 alpha), (SL4), and (SL5) genes j 


; O v45c04.x1 Soarss_testis_NHT Homo sapiens cDNA clone IMAGE: 1640262 3' j 


ov45c04.x1 Soares_testis_NHT Homo sapiens cDNA clone IMAGE:1 640262 3' | 
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lAPOUPOPROTEIN A-IV PRECURSOR (APO-AIV) j 


jzs44f01.r1 NCI_CGAP_GCB1 Homo sapiens cDNA clone IMAGE:700345 5' j 


: Homo sapiens KVLQT1 gene j 


Epstein-Barr virus (AG876 isolate) U2-IR2 domain encoding nuclear protein EBNA2, complete cds I 


! Epstein-Barr virus (AG876 isolate) U2-IR2 domain encoding nuclear protein EBNA2, complete cds I 


601589841F1 NIH_MGC_7 Homo sapiens cDNA clone IMAGE.3943954 5' | 


COLLAGEN ALPHA 1(VII) CHAIN PRECURSOR (LONG-CHAIN COLLAGEN) (LC COLLAGEN) \ 


yy37h06.r1 Soares melanocyte 2NbHM Homo sapiens cDNA clone IMAGE:27D587 5' similar to contains 1 
1 clement MER6 repetitive element ; j 


yyO7h06.r1 Soares melanocyte 2NbHM Homo sapiens cDNA clone IMAGE:270587 5' similar to contains I 
element MER6 repetitive element ; | 


ab6Eg12.s1 Stratagene lung carcinoma 937218 Homo sapiens cDNA clone IMAGE:845734 3' j 


602068042F1 NIH_MGC_58 Home sapiens cDNA clone IMAGE:4066907 5' | 
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Homo sapiens T-cell lymphoma invasion and metastasis 1 (TIAM1 ), mRNA j 
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Human gene for fourth somatostatin receotor subtype j 


Homo sapiens 959 kb contig between AML1 and CBR1 on chromosome 21q22, segment 2/3 j 


Homo sapiens X2B region near ALD locus containing dual specificity phosphatase 9 (DUSP9), ribosomal 
protein L1 8a (RPL18a), Ca2+/Calmodulin-dependent protein kinase 1 (CAMKI), creatine transporter (CRTR), 
CDM protein (CDM), adrenoleukodystrophy protein > 


I601491081F1 NIH_MGC_S9 Homo sapiens cDNA clone IMAGE:3893276 5' j 


Homo sapiens prolaetin-releasing peptide receptor gene, 5' flanking region I 


Homo sapiens partial steerin-1 gene | 


zkS7c09.s1 Soares_pregnant_uterus_NbHPU Homo sapiens cDNA clone IMAGE:490768 3" similar to 1 
contains L1.t1 L1 repetitive element ; j 
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Top Hit Descriptor 


| Rattus norvegicus plasma membrane Ca2+ ATPase Isoform 3 (PMCA3) gene, 5' flanking region j 
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|CM3-LT0079-170200-092-«07 LT0079 Humu sapiens cDNA I 


Homo sapiens X28 region near ALD locus containing dual specificity phosphatase 9 (DUSP9), ribosomal 
protein L1 3a (RPL1 8a), Ca2+JCalmodulin-dependent protein kinase I (CAMKI), creatine transporter (CRTR), 
CDM protein (CDM), adrenoleukodystrophy protein > 


[Human class III alcohol dehydrogenase (ADH5) chi subunit mRNA, complete cds i 


jHuman class III alcohol dehydrogenase (ADH5)chi subunit mRNA, complete cds | 


| Thermotoga neapolitana alpha-1 ,6-galactosidase (agIA) gene, complete cds j 


| Thermotoga neapolitana alpha-1 ,S-galactosidase (agIA) gene, complete cds | 
BONE PROTEOGLYCAN II PRECURSOR (PG-S2) (DECORIN) (PG40) (DERMATAN SULFATE 
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|am58c09.x1 Johnston frontal cortex Homo sapiens cDNA clone IMAGE: 1 539760 3' 


|Homo sapiens tubulin, beta, 4 (TUBB4) mRNA j 
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|tj01f11.x1 NCI CGAP Gas4 Homo sapiens cDNA clone IMAGE:2140269 3' similar to contains Alu repetitive | 


jUI-H-BI1-adm-c-04-0-Ul.s1 NCI_CGAP_Sub3 Homo sapiens cDNA clone IMAGE:2717190 3' 


|Hcmo sapiens DNA for amyloid precursor protein, complete cds ] 


|yx26c09.s1 Soares melanocyte 2NMHM Homo sapiens cDNA clone IMAGE:262864 3' similar to contains I 
L1 tl L1 repetitive element ; | 


RETROVIRUS-RELATED POL POLYPROTEIN [CONTAINS: REVERSE TRANSCRIPTASE ; ' 1 
iENDONUCLEASE] | 


;UI-H-BI0-aab-e-GB-C-UI.s1 NCI_CGAP_Sub1 Homo sapiens cDNA clone IMAGE:27D8825 3' 


i 
i 

o 

1 

1 

1 

1 
t 

1 

5 

1 
z 


jAnguilla anguilla dopamine D1 A1 receptor (d1A1) gene, complete cds | 


Kaposi's sarcoma-associated herpesvirus ORF 68 gene, partial cds; and ORF 69, kaposin, v-FLIP, v-cyclin, 
latent nuclear antigen, ORF K14, v-GPCR, putative pbosphcribosyiformyiglycinamidine synthase, and LAMP 
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Izi34b08.s1 Soares_fetal_liver_spleen_1NFLS_Sl Homo sapiens cDNA clone IMAGE 
[contains L1.t1 L1 repetitive element ; 


|Hcmo sapiens PP1200 mRNA, complete cds 
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LINE-1 LIKE PROTEIN contains Llt2 L1 repetitive element ; 


Ihq64d12.x1 NCI_CGAP_HN13 Homo sapiens cDNA clone IMAGE3124151 3' 
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! yb78b10.r1 Stratagene ovary (#93721 7) Homo sapiens cDNA clone IMAGE.77275 5' 
repetitive element 


Homo sapiens gene for alpha-1-microglobulin-bikunin, exons 1-5 (encoding alpha-1-rr 
I terminus.) 


AU159412 THYR01 Homo sapiens cDNA clone THYRO1001602 3' 


LINE-1 REVERSE TRANSCRIPTASE HOMOLOG 


I60133S213F1 NIH_MGC_44 Homo sapiens cDNA clone IMAGE:3690314 5' 


PAROTID SECRETORY PROTEIN PRECURSOR (PSP) 


RCO-LT0001-2S1 199-01 1-A03 LT0001 Homo sapiens cDNA 


HOMEOBOX PROTEIN GOOSECOID 


POL POLYPROTEIN [CONTAINS: PROTEASE ; REVERSE TRANSCRIPTASE ; E 


wa04a03.x1 NCI_CGAP_Kid1 1 Homo sapiens cDNA clone IMAGE:2297068 3' simila 
MER3Q repetitive element ; 


HISTIDINE-RICH GLYCOPROTEIN PRECURSOR 


KNOB-ASSOCIATED HISTIDINE-RICH PROTEIN PRECURSOR (KAHRP) 


1 
d 
1 

u 

I 

1 
y 

o 

1 


zp02e05.r1 Stratagene ovarian cancer (#937219) Homo sapiens cDNA clone IMAGE: 


UI4H-BB-aky-g-05-0-Ul.s1 NCI CGAP Sub5 Homo sapiens cDNA clone IMAGE:27S 


Mus musculus gene for odorant receptor A16, complete cds 


on34h01.s1 NCI_CGAP_Lu5 Homo sapiens cDNA clone IMAGE:1558609 3' similar t 
element; 


te51f05,x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA cone IMAGE:2090241 3'! 
Q13537 MER37 TRANSPOSABLE ELEMENT, COMPLETE CONSENSUS SEQUE 


Wj90b04.x1 NCI_CGAP_Lym12 Homo sapiens cDNA clone IMAGE:2410063 3' 


Top Hit 

Database 


i 




| 


| 


1 


1 








iPROT [ 


I 


SWISSPROT | 




1 

w 

CO 


SWISSPROT 1 


2 


IPROT | 


O 


iUMAN | 


5 

:• 






| 


| 


| 












EST h 








ft 


:est <r 






SSI MS 


% 


i 








EST H 






Top Hit Acession 
No. 


1 


i 








|BE047094.1 


| 


X54815.1 




£ 




1 


1 






1 


1 


I 

s 


AV657555.1 | 


1 


1 
| 


1 

1 


1 


i 
1 


< 


MostSimilar 
(Top) Hit 
BLAST E 
Value 




3 


I 


s 


I 


3 




3.0E-06; 


! 3.0E-06| 


3 


3 


1 


3.0E-06| 


I 


3 






3 


2.0E-06| 


2.0E-06| 




3 


2.0E-06i! 


3 


§ 


li 




?■ 














e 




d 






























ORF SEQ 
ID NO: 


1 




1 




| 29762 


1 


I 30454 


30550 1 


1 






1 








1 


l 


1 




1 


a 


1 




1 


32948| 


jgl 






I 


1 


1 


1 


§ 


1 


! 




1 


e 




8 


1 


1 


1 


I 


1 


1 


S 


1 




I 
























































Probe 
SEQID 

NO: 


S 


i 


1 


8 


1 


I 


1 




i 


1 


s 




1 




1 


1 


1 


1 




1 


1 




s 


1 





306 



WO 01/57273 



PCT/LS01/00664 



il 



ii 



il 



in 



in 



s. 

II 



3(17 



WO 01/57273 



PCT/US01/00664 



1131 

o Cm 



WO 01/57273 PCT/LS01/00664 



II 



is 

I c 

III 
ill 

= E & 



.1 « I 

Hi 



309 



WO 01/57273 



PCT/LS01/00664 



i 

c 
5 
h 


i 

j 

■ 


|Homo sapiens chromosome 21 segment HS21 C01 8 [ 


|wi81 b08.x1 NCI_CGAP_Kid12 Homo sapiens cDNA clone IMAGE:2399703 3' J 


|wi81 bC8.x1 NCI_CGAP_Kid12 Homo sapiens cDNA done IMAGE:2399703 3' | 


]PM1-BN0033-030300-003-e12 BN0083 Homo sapiens cDNA 


jHuman microfibril-associatedglycoprotein (MFAP2) gene, putative promoter region and alternatively spliced 1 
juntranslated exons j 


|Homo sapiens Xq pseudoautosomal region; segment 1/2 ; 


|Fi_man polymorphic microsatslli.e DNA j 


|Hi.man IgK subgroup 1 germl necene, exons 1 and 2, V-regbn 01 8 allele I 


Ini56b09.s1 NCI_CGAP_0v2 Homo sapiens cDNA clone IMAGE:980825 similar to contains Alu repetitive I 
jeiement;contains L1 ,t3 L1 repetitive element ; | 


jHuman polymorphic microsatellite DNA | 


z 

§ 
I 

1 
1 


jMR0-BN0115-020300-001-f11 BN0115 Homo sapiens cDNA I 


|yd50f12.r1 Scares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:1 11695 5' j 


[HYPOTHETICAL 63.8 KD PROTEIN IN GUT1-RIM1 INTERGENIC REGION PRECURSOR j 


§ 
1 
| 

E 

1 


| 
1 

% 
I 
81 

| 

1 


yc1 4h09.s1 Stratagene lung (#937210) Homo sapiens cDNA clone IMAGE.80705 3' similar to similar to 
|gb:M62982 ARACHIDONATE 12-LIPOXYGENASE (HUMAN) 


Iyu1 4h09.s1 Stratagene lung (#937210) Homo sapiens cDNA clone IMAGES0705 3' similar to similar to j 
| gb:M62982 ARACHIDONATE 12-LIPOXYGENASE (HUMAN) j 


IPROTEIN-ARGININE DEIMINASE TYPE IV (PEPTIDYLARGININE DEIMINASE IV)(PAD-R4) [ 
[(PEPTIDYLARGININE DEIMINASE TYPE ALPHA) j 


|WNT-'4 PROTEIN PRECURSOR | 


1 

1 
1 

1 
1 
j 

0 

5 

B 


|QV1-UM0035-200300-1 1 5-g02 UM0036 Homo sapiens cDNA 


Itw28f1 1 ,x1 NCI_CGAP_Ov35 Homo sapiens cDNA clone IMAGE:2261 037 3' similar to contains Alu j 
[repetitive element;contains element MSR1 MSR1 repetitive element ; ] 


i 

s. 

| 

9 
1 


|Homo sapiens chromosome 9 duplication of the T cell receptor beta locus and tripsinogen gene families | 


Rattus norvegicus mRNA for 45 kDa secretory protein, partial | 


|Homo sapiens TRF2-interactng telomeric RAP1 protein (RAP1) mRNA, complete cds ! 


Top Hit 
Database 




IEST HUMAN 


EST HUMAN | 


| 










EST_HUMAN \ 




EST HUMAN | 


| 

i 


1 


I 






| 


1 


SWISSPROT | 


O 

% 
% 


| 


| 


1 

a 


:EST HUMAN 








1 
S 

I 




|AL1 53218.2 ! 


1 
1 


1 

< 


1 


U19719.1 \ 


IAJ271735.1 | 








1 
I 


|BE005077.1 i 


IBE005077.1 | 


t 


l 


§ 
< 




1 
& 






1 

3 


I 


§ 
1 


1 
1 


i 

I 


i 

£ 




< 


Most Similar 
(Top) Hit 
BLAST E 


4.0E-07| 


§ 


1 


! 


9 
S 


3.0E-07| 


9 

s 


9 


3.OE-O7I 


3.0E-07I 


3.0E-07I 


3.0E-07| 


3.0E-07| 


9 
ri 


3 


3.0E-07| 


9 


3.0E-07 


3.OE-O7I 


3.0E-07| 


3.0E-07J 


3.0E-07[ 


1 


1 


1 


9 


i 


.§ 
V, 
S> 

£ 


| 


















9 






8 




?j 


s 










§ 








! 


q 






ORF SEQ 


Q 


| 37128 








s 

8 


8 


1 27387: 






l 


1 






¥ 

R 










320531 


1 




8 










I 


Jog 


I 23700 






24574, 


8 




! 


! 


S 

s 




i 


I 


a 


S 

S 




s 


§ 


8 


1 


£8 
8 


Si 


s 

I 








§ 




































SEQ ID 

NO: 


I 10814 


| 11376 


| 11376, 


1 116701 


1 


1 




1 


I 




1 


f, 


1 


1 


i 


1 




8 
5 


1 


1 


1 


I 




1 


s 
§ 


8 
8 





310 



WO 01/57273 



PCT/LS01/00664 



Top Hit Descriptor 


Hcmo sapiens DiGeorge syndrome critical region, telomeric end j 


Hcmo sapiens DiGeorge syndrome critical region, telomeric end | 


Fugu rubripes beta-cytoplasmic(vascular) actin gene, complete cds 


Hcmo sapiens homeobox protein CDX4 (CDX4) gene, complete cds and flanking repeat regions i 


Homo sapiens homeobox protein CDX4 (CDX4) gene, complete cds and flanking repeat regions | 


RETROVIRUS-RELATED POL POLYPROTEIN [CONTAINS: REVERSE TRANSCRIPTASE ; I 
ENDO NUCLEASE] | 


Zrf)8b07.s1 Stratagene NT2 neuronal precursor 937230 Homo sapiens cDNA clone IMAGE:650869 3' similar 
to gb:L31860 GLYCOPHORIN A PRECURSOR (HUMAN);contains Alu repetitive element; 


yc15gD4.s1 Stratagene lung (S937210) Homo sapiens cDNA clone IMAGE:80790 3' similar to contains L1 j 


1 
| 
< 


HYPOTHETICAL 72.5 KD PROTEIN C2F7.10 IN CHROMOSOME I | 


601818916F1 NIH MGC_58 Homo sapiens cDNA clone IMAGE:4044891 5' i 


Hcmo sapiens caveolin 1 (CAV1) gene, exon 3 and partial cds j 


11 
I C 
1 ■ 

J. 

ij 

ll 
i I 

a a 


■J 

1 

1 

9 

4 
1 


s 
1 

1 
1 

i 

i 

| 
1 

J 

1 

s 


nm33a0S.s1 NCI_CGAP_Llp2 Homo sapiens cDNA clone IMAGE:1061938 similar to contains Alu repetitive I 
element; ! 


AV729390 HTC Homo sapiens cDNA clone HTCAEG02 5' | 


zk27g09.s1 Scares _pregnant_uterus_NbHPU Homo sapiens cDNA clone IMAGE:471808 3' | 


1 

1 

1 

8 

1 


CM4-NN0003-280300-' 24-e06 NN0003 Homo sapiens cDNA j 


COMPLEMENT FACTOR B PRECURSOR (C3/C5 CONVERTASE) (PROPERDIN FACTOR B) j 
(GLYCINE-RICH BETA GLYCOPROTEIN) (GBG) (PBF2) i 


COMPLEMENT FACTOR B PRECURSOR (C3/C5 CONVERTASE) (PROPERDIN FACTOR B) 1 
(GLYCINE-RICH BETA GLYCOPROTEIN) (GBG) (PBF2) j 


PM0-HT0339-2601 00-005-H07 HT0339 Homo sapiens oDNA | 


zn85h1 1 ,x5 Stratagene lung carcinoma 937218 Homo sapiens cDNA clone IMAGE:565029 3' similar to I 
contains THR.b2 THR repetitive element ; 


1 

1 

I 

g' 

i 
i 

1 

1 


£ J 1 












1 


| 


5 




5PROT | 


| 




II 


?! 


| 


3 
3 


| 


1 




lUMAN | 




1 




iUMAN | 




£|£ 












a 
% 


Sb 




1 


i 


i 




















1 


1 




a 




TopHitAoession 


L77569.1 ; 


;L77569.1 | 


1U38849.1 | 


IAF003530.1 | 


S 
£ 
< 


1 


i 


T63042.1 I 


! 

n 

a 


IQDG701 I 


K 


1 


In 

1 ! 


ii 
ii 


IAI208715.1 | 


l 


§ 


|AA035198.1 ! 


I 


s 

1 


i 


P0O751 i 


1 


I 

< 


1 


Most Similar 
(Top) Hit 
BLAST E 
Value 


2.0E-07| 


2.0E-07| 


2.0E-07| 


2.0E-07| 


9 

a 


3 


9 


2.0E-07| 


9 
9 


s 

9 


9 
S 


9 


9 « 
g« 


= 9 


9 


9 

B 


9 


2.0E-07| 


9 


2.0E-07| 


9 


3 


3 
9 


9 

Uj 


1.0E-07| 


Expression 


s 




1 






1 


s 




W 




8 




K c 












0 






s 


8 






















° 








O T 


0 










*" 














ORFSEQ 
ID NO: 




8 




i 








1 


8 






& 








1 








§ 




1 








SEQ ID 
NO: 


1 


1 












§ 








I 






1 


1 


1 






1 

S3 


1 










SEQID 
NO: 


s 


S 




Pi 










1 


1 


i 












8 


8 


1 


8 


SB 


s 


1 


































1 


1 




Si 






§ 




1 





311 




312 



WO 01/57273 PCT/LS01/00664 













§ 

E 
1 


cn1 5c02 random I 


































3 contains 
'TASE ; 






tains Alu repetitive | 


















HTBC 


































lone IMAGE:1335368 3' similar ti 
TAINS: REVERSE TRANSCRIF 






8 
S 






i 




DNA clone IMAGE:2328: 


1 

1 

1 


lo 

1 
1 


s 

■1 

Si 


1 
| 

1 




gene, complete cds 






















| 












.ne IMAGE:943193 stmll< 








1 

I 

1 
t 
j 


g 

! 
| 

I 

I 
1 


|601580133F1 NIH MGC_7 Homo sapiens cDNA clone 


1 

I 

1 

i 

1 
1 


1 
1 

f 

1 
I 


|cn15c02.x1 Normal Human Trabecular Bone Cells Horn 


]EST382776 MAGE resequences, MAGK Homo sapien: 


I 

i 

! 

1 

! 

| 


: ANKYRIN 1 (ERYTHROCYTE ANKYRIN) 


Rat mRNA for ribosomal protein L31 


; DYNEIN HEAVY CHAIN (DYHC) 


iDYNEIN HEAVY CHAIN (DYHC) 


!cong3.P1 1 .A5 conorm Homo sapiens cDNA 3' 


jRattus norvsglcus Munc13-1 mRNA, complete cds 


□YNEIN HEAVY CHAIN (DYHC) 


□YNEIN HEAVY CHAIN (DYHC) 


Homo sapiens chromosome 21 segment HS21 C048 


Homo sapiens chromosome 21 segment HS21C048 


.1 

| 

1 

3 
1 

1 
1 
s 


Homo sapiens K1AA1074 protein (K1AA1074), mRNA 


Homo sapiens chromosome 21 segment HS21C048 


LINE-1 REVERSE TRANSCRIPTASE HOMO-OG 


ob5Sc05.s1 NCI_CGAP_GCB1 Homo sapiens cDNA cl 
MER12.b3 MER12 repetitive element ; 
RETROVIRUS-RELATED POL POLYPROTEIN [CON 


ENDONUCLEASE] 

Homo sapiens chromosome 21 segment HS21 C009 


Homo sapiens chromosome 21 segment HS21C103 


nh03b09.s1 NCI_CGAP_Thy1 Homo sapiens cDNA clc 


COMPLEMENT C2 PRECURSOR (C3/C5 CONVERT 


QV0-CT0225-131099-034-a12 CT0225 Horro sapiens , 


fiji 




§ 


I 


| 


| 


1 


l 




JPROT 




JPROT 


s 






SWISSPROT i 


S 






1 






SPROT | 


lUMAN 


I 




| 




| 












S3 


I 


a 




S2 




3 

s 


1 
% 






1 












52 


jjj 


SWISJ 
NT 






52 

1 


ESTJ- 


Top HitAcession 


|AJ251973.1 I 


1 


i 

1 


1 


1 
< 


1 
• < 


1 
1 


[AF253417.1 | 


IQ02357 | 


1 

s 


IP15305 ! 


IP15305 j 


IAI535743.1 j 


1 


I 






AL163248.2 i 






AL1Q3248.2 j 


P08347 ! 


AA827075.1 


P113S9 I 
AL1 63209.2 j 


si 

i 


1 


1 


AW85187S.1 | 


Most Similar 
(Top) Hit 
BLAST E 
Value 


| 9.0E-08| 




1 


3 


— 


» 


1 


| 8.0E-08| 


9 


9 


| 7.0E-08| 


9 


9 


9 


9 


3 


3 


3 


f 


j 6.0E-08| 


j 6.0E-08| 


9 


? 
§ 


33 
§ § 


3 


9 


9 


5.0E-08| 


ii 

r 


































9 




~ 
















t 


s 


ORF SEQ 
ID NO: 










1 




1 




§ 


1 


° 


1 






s 

s 


1 


1 


1 


1 


|9Q06S 


l 






3 
5 


I 


s 
s 




1 


SEQ ID 
NO: 


8 


1 


a 


1 






I 

n 


1 




1 








1 




1 


1 


1 


1 






1 


| 


11 


1 


I 


s 
s 




SEQ ID 

NO: 


1 


i 


1 


1 


1 


I 


s 


1 


s 




s 


1 


B 


I 


1 


I 






E 
8 




5? 




! 


1 1 




8 


K 

a 


1 



313 



WO 01/57273 



PCT/LS01/00664 









I 


;lonelMAGE:1674458 3' similar to I 








f 




1 
1 

o 

s 


)9411 3' similar Id oontalns Alu | 


17 5' similar to TR:G505579 


5' similar to TRG505579 ( 






1AGE:345556 5' similar to contains 


1 

I 
§ 
B 

1 


]■ similar to TR.Q9Z158 Q9Z158 






one IMAGE2126273 3' similar to 
! CONSENSUS SEQUENCE. ; 




871 95 3' similar to gb:M34079 TAt| 


Top Hit Descrptor 


pORSAL-VENTRAL PATTERNING TOLLOID PROTEIN PRECURSOF 


| DORSAL-VENTRAL PATTERNING TOLLOID PROTEIN PRECURSOF 


|DKFZp434J042S_r1 434 (synonym: htes3) Homo sapiens cDNA clone D: 


Ioz05e02.x1 Soares_fetalJiver_spleen_1NFLS_S1 Homo sapiens oDN A ( 
contains Alu repetitive element; 


|Homo sapiens shox gene, alternatively spliced products, complete cds 


|URIDINE PHOSPHORYLASE (UDRPASE) 


| TRANSMEMBRANE PROTEASE. SERINE 2 


ICricetulus griseus ribosDmal transcription factor (UBF2) mRNA, complete 


|LINE-1 REVERSE TRANSCRIPTASE HOMOLOG 


1 

i 
i 

I 
1 

1 
1 

i 

s 
1 

1 


Ian22d10.x1 Gessler Wilms tumor Homo sapiens cDNA clone IMAGE:16£ 
|repctitivc elomont;conta ; ns element MER22 repetitive clement ; 


8 
ft 
1 

1 
| 

i 

is 

\— a 

I' 5 
IS 

J| 

S s 
s » 
s s 


i 
1 

I 

is 

si 

II 
II 

Tl O 


|602248024F1 NIH_MGC_62 Homo sapiens cDNA clone IMAGE:433330 


I602248024F1 NIH MGC 62 Homo sapiens cDNA clone IMAGE:433330i 


;zd35g03.r1 Soares_fetal_rieart_NbHH19W Homo sapiens cDNA clone IN 
L1,t1 L1 repetitive element ; 


|tb95a1 1 X1 NCI_CGAP_Co1 6 Homo sapiens cDNA clone IMAGE:20620' 
WER1 8 MER1 8 repetitive element ; 


bb79a10.y1 NIH_MGC_10 Homo sapiens cDNA clone IMAGE:3048570 £ 
SYNTAXIN 17. ; 


qs7Sfl 1 ,y5 NCI_CGAP_Pr28 Homo sapiens cDNA clone 1MAGE:194404 


! Homo sapiens chromosome 21 segment HS21 C046 


th93h09:x1 Soares NSF F8 9W OT PA P S1 Homo sapiens cDNA cl 
TR:Q1 3537 Q1 3537 MER37 TRANSPOSABLE ELEMENT COMPLETE 


Homo sapiens MHC class 1 regiori 


yp12b10.s1 Soares breast 3NbHBst Homo sapiens cDNA clone IMAGE:1 
BINDING PROTEIN-1 (HUMAN); 


Sis 


SPROT 


SPROT 


HUMAN 


| 




SPROT 






O 
I 


HUMAN 


HUMAN 


HUMAN 


HUMAN 


| 




HUMAN | 


HUMAN | 


HUMAN | 






| 








f 


f 








|Sft!S 


i 




I 






























TopHitAoession 


|P25723 


|P25723 


1 


i 

2 


1 


I 


o 




]P08547 i 




AI050027.1 


1 
I 


S 
S 

1 






i 


AI343353.1 | 


1 
1 


AI792737.1 | 


i 
1 


AI436352.1 


AF055066.1 j 


i 

i 


Most Similar 
(Top) Hit 
BLAST E 
Value 


2 


2 


9 


9 


9 


9 


9 


9 


4.0E-08| 


4.0E-08! 


4.0E-08 


9 


9 


9 


9 


4.0E-08| 


4.0E-Os| 


s 


3.0E-08| 


9 

ri 


3.0E-08 


3.0E-08| 


3.0E-0s| 


Expression 
Signal 


a 




1 










«! 


1 
























% 




s? 


ORFSEQ 
ID NO: 










I 


1 




1 










1 


1 


I 






1 




1 








SEQ ID 

NO: 


1 


| 


1 








a 


Si 


1 


B 


8 


1 


1 


1 


1 


§ 


25476| 


s 


I 


1 


Si 
S3 




1 


Probe 
SEQ ID 
NO: 






1 






I 


8 
S 














| 11533| 






B 
R 


1 








| 


8 

5 



314 



WO 01/57273 



PCT/LS01/00664 




WO 01/57273 PCT/LS01/00664 











1 


1 
1 










tOTEIN)(CTP) | 












S3 

1 


1 
1 






1 

I 




1 




3 




CLEASE] 






i 


I 




I 

o 






MSPORTPF 












S 
& 
E 


| 






f 

1 




a 

1 c 




1 


Top Hit Descriptor 


|POL POLYPROTEIN [CONTAINS: REVERSE TRANSCRIPTASE ; ENDONU 


jhcmo sapiens caveolin 1 (GAVI)gene, exon 3 and partial cds 


|PM2-HT0130-150S99-001-f12 HT0130 Homo sapiens cDNA 


TCBAP1D5232 Pediatric pre-B cell acute lymphoblastic leukemia Baylor-HGSC 
jsaplens cDNA clone TCBAP5232 


|TCBAP1D5232 Pediatric pre-B cell acute lymphoblastic leukemia Baylor-HGSC 
: sapiens cDNA clone TCBAP5232 


? 

5 

i 

1 


|52 KD RO PROTEIN (SJOGREN SYNDROME TYPE A ANTIGEN (SS-A)) (R 


iot35a05.s1 Soares_testis_NHT Homo sapiens cDNA clone IMAGE:1618736 3' 


! 

! 
§ 

o 

1 
1 


TRICARBOXY1 ATE TRANSPORT PROTEIN PRECURSOR (CITRATE TRAI 
(TR1CARBOXYLATE CARRIER PROTEIN) 


BONE MORPHOGENETIC PROTEIN 1 PRECURSOR (BMP-1) 


Homo sapiens major histocompatibility locus class III region 


Humeri lambda-immunoglobulin constant region complex (germline) 


Homo sapiens chromosome 21 segment HS21C079 


1 

1 
I 

1 
| 

| 

1 
1 

1 

I 


qu86d 1 xl NCLCGAP_Gas4 Homo sapiens cDNA clone lMAGE:19789e4 3' : 
repetitive element ; 


Iqd42e07.x1 Soares_fetal_heart_NbHH19W Homo sapiens cDNA clone IMAGE 
contains MSR1 .tl MSR1 repetitive element ; 


< 
1 
I 
| 

| 

i 
i 
1 


!op74d08.s1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:1 582 


Home sapiens DNA for 3-ketoacy:-CoA thiolase beta-subunit of mitochondrial tri 


jHuman familial Alzheimer's d'seaae (STM2) gene, complete cds 


;zr80c05.rt Soares_NhHMPu_S1 Homo sapiens cDNA clone IMAGE:681 992 5' 
repetitive element ; 


6311 1 1 173F1 NIH_MGC_16 Homo sapiens cDNA clone IMAGE-.3351834 5' 


!z?58e07.s1 Scares retina N2b4HR Homo sapiens cDNA clone IMAGE:381156 
[repetitive element ; 


111 


3SPROT | 




HUMAN I 


1 


5 
1 




5SPROT | 


| 


1 




& 
1 








§ 


HUMAN | 


HUMAN | 


1 


| 








HUMAN | 


HUMAN | 




TO 




a 


E2 


ES 




1 


a 


a 


1 


1 














a 


a 






a^ 


Eft 


a 


f 

8 n 


1 


1 


s 


1 
1 


1 


|AJ010770.1 i 




| AI01 5304.1 




! 


8 
1 


1 


S 

5. 


ii 




|aI270615.1 I 


1 


i 
§ 
1 


1 
1 




1 


IaA256200.1 


n 


1 
I 


1 £ i- a 


1 .OE-08! 


1 


3 


3 


S 


3 


I.OE-OSl 


9 


3 


3 


I 


1. OE-08! 


9 

q L 


1 1 


3.0E-0S| 


? 

s 


3 


3 


1 


3 


9 


9< 
2^ 


! 3 


3 


ill* 


















































Expression 




8 


s 






s 






s 
° 








» 










CM 










8 S 




ORFSEQ 
ID NO: 




1 




3 


8 






1 


1 


s 


1 






SB 








1 
I 








j 


II 




ih 


1 


1 


8 
8 


1 


| 


e 
1 






1 


I 




1 




!| 


s 
a 


1 


20574) 




SI 


8 




M 


18 

3 S3 


2371 9) 


1 1 ° 


g 


1 


1 


1 


1 




1 










1 


1 




s 




I 


B 






1 


il 


11 


1 






















| 






8 


& 




S 















316 



WO 01/57273 



PCT/US01/00664 




WO 01/57273 



PCT/US01/00664 

















CM' 


























! 












s 




S 


1 










1 


























zx63h06.r1 Soares_totaLfetus_Nb2HF8_9w Homo sapiens cDNA clone IMAGE:796187 5' similar to c 
Alu repetitive element; 
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hu09e09.x1 NCI_CGAP_Lu24 Homo sapiens cDNA clone IMAGE:31661 20 3' similar to contain: 
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-Homo sapiens eukaryotic initiation factor 4AI (EIF4A1) gene, partial cds 


|258.1 KDA PROTEIN C210RF5 (KIAA0933) 
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Top Hit Descriptor 


7o78d08.x1 NCI_CGAP_Kid11 Homo sapiens cDNA clone IMAGE:3642303 3' similar to contains L1.t3 L1 I 
repetitive element ; I 
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AV652123 GLC Homo sapiens cDNA clone GLCCXA1 1 3' j 


1 QV0-CT0225-1 91 1 99-058-eOS CT0225 Homo sapiens cDNA | 


QV2 TT0003-1 61 1 99-0 1 3-g1 0 TT0003 Homo sapiens cDN A j 


DKFZp434N1317_r1 434 (synonym: htes3) Homo sapiens cDNA clone DKFZp434N1317 5' | 


DK.-Zp434N1317_r1 434 (synonym: htes3) Homo sapiens cDNA clone DKFZp434N1317 5' ] 


Homo sapiens nuclear factor of kappa light polypeptide gane enhancer in B-cells 1 (NFKB1 ) gene, complete | 


HomD sapiens X23 region near ALD locus containing dual specificity phosphatase 9 (DUSP9), ribosomal 
protein L18a(RPLl8a), Ca2+yCalmodulin-dependent protein kinasB I (CAMKI), creatine transporter (CRTR), 
CDM protein (CDM), adrenoleukody3trophy protein > 

Homo sapiens X28 region near ALD locus containing dual specificity phosphatase 9 (DUSP9), ribosomal 


CDM protein (CDM), edrenolcukodyctrophy protein > 

Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1 , complete cds j 


Human pregnancy-specific glycoprotein beta-1 (SP1) mRNA, last exon | 
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MER31 ,t1 MER31 repetitive element ; j 
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Top Hit Descriptor 


| PREGNANCY ZONE PROTEAN PRECURSOR 


jHomo sapiens chromosome 21 segment HS21C100 


|homc sapiens chromosome 21 segment HS21C100 


IIL5-BT0578-130300-03B-G12 BT0578 Homo sapiens cDNA 


|Homo sapiens Xq pseudoautosomal region; segments 


|34 KD SPICULE MATRIX PROTEIN PRECURSOR (LSM34) 


Izj23g01.s1 Scares fetaljiver spleen 'NFLS S1 Homo sapiens cDNA done IMAGE:451 152 


I AV730554 HTF Homo sapiens cDNA clone HTFAWF06 5' 


nz88C1 1 s1 NCI_CGAP_GCB1 Homo sapiens cDNA clone IMAGE:1302573 3' similar to conta 
repetitive element; 


Homo sapiens FRA3B common fragile region, diadenosine triphosphate hydrolase (FHIT) gene 
:Morone saxatilis myosin heavy chain FM3A (FM3A)mRNA, complete cds 
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1 lomo sapiens chromosome 21 segment HS21 C078 
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Top Hit Descriptor 


Jzt77a12.s1 Soares_testis_NHT Homo sapiens cDNA clone iMAGE:728350 3' similar to contains Alu j 
Jrepetitive element;contains element MER22 repetitive element ; j 


jGAP JUNCTION BETA-1 PROTEIN (CONNEXIN 30) (CX30) ( 
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A32995 1 complex sterility protein - mouse ; | 
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qn32d05.x1 NCI_CGAP_Kid5 Homo sapiens cDNA clone 1MAGE:1899945 3' similar to contains Alu 
repetitive element I 
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Homo sapiens X-linked anhidroitic ectodermal dysplasia protein gene (EDA), exon 2 and flanking repeat 1 
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!FGF-1=fibroe]ast growth factor 1 [human, kidney, Genomic, 342 nt, segment 2 of 2] 


jHomo sapiens LGMD2B gene 


H.sapiens DMA, DMB, HLA-Z1, IPP2, LMP2, TAP1, LMP7, TAP2, DOB, DQB2 and RING8, 9, 


inw21 g02.s1 NCI_CGAP_GCB0 Homo sapiens cDNA clone IMAGE:1241 138 3' similar to contaii 
THR repetitive element ; 


602038009F1 NCI_CGAP_Brn64 Homo sapiens cDNA clone IMAGE:4185866 S 


yl 535. seq.F Human fetal heart, Lambda ZAP Express Homo sapiens cDNA 5' 


nn24d01.s1 NCI_CGAP_Gas1 Homo sapiens cDNA clone IMAGE:1084801 3' similar to contain! 
repetitive element,conta'ns element MEF124 repetitive element ; 


nn24d01.s1 NCLCGAP_Gas1 Homo sapiens cDNA clone IMAGE:1 084801 3' similar to contain 
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repetitive element ; 


1 
t 
| 
1 

i 

1 


Homo sapiens mRNA for sodium-glucose cotransporter (SGLT2 gene) 


Homo sapiens mRNA for sodium-glucose cotransporter (SGLT2 gene) 


Homo sapiens TFF gene cluster for trefoil factor, complete cds 


xo54h05.x1 NCI_CGAP_Ut1 Homo sapiens cDNA clone IMAGE:2707833 3' 


aj24c01 s1 Soares_testis_NHT Homo sapiens cDNA clone 1 391232 3' similar to contains MER1 
repetitive element ; 


Human DNA, SINE repetitive element 


Saguinus oedipus gene for seminal vesicle secreted protein semenogelin 1 


I 
1 

E 
| 

I 
§ 

1, 

§ 

1 


yi72e03.r1 Scares placenta Nb2HP Homo sapiens cDNA clone IMAGE:144796 3' 


III 


EST HUMAN | 








EST HUMAN [ 


EST HUMAN | 


EST HUMAN | 






SWISSPROT | 




EST HUMAN | 


< 




3 


i 


EST HUMAN | 








? 


EST HUMAN | 




z 


EST HUMAN | 


| 


1 

8 






















































I 


l 




]AJ 007973.1 


I 

s 


I 


S 

£ 


1 


s 

!S 


1 


I 

o 


I 


§ 

LL 


B 
1 


1 


i 
1 




1 

1 


si 


5 


I 
1 






D1 4547.1 


i 
§ 














V 














































IIP 






























§ 














1 








§ 
























































.1 




™ 


s 


s 




°> 






s 


























5 










8 g> 






















































ORFSEQ 
ID NO: 






I 


s 


§ 


| 30572| 


s 


s 






I 


8 






8 


s 
R 




1 




8 


1 


1 


I 


1 






IP 


8 

a 


I 


s 


1 


8 


17709 


f 


1 


8 


1 


1 






I 


1 


I 


1 


1 


§ 


1 


1 


? 


1 


1 




I 






















































Probe 
SEQID 


8 






1 13641 


1 




| 6713! 


1 


I 


1 


| 10790| 




| 12292| 


I 






1 


B 


8 


1 




s 


1 


1 


1 


1 



332 



WO 01/57273 PCT/LS01/00664 



Top Hit Descriptor 


] H. sspiens DNA for endogenous retroviral like element i 


Izq1 7c13.s1 Stratagene fetal retina 937202 Homo sapiens cDNA clone IMAGE.-62SS7D 3' j 


wc92hC8.x1 NCI_CGAP_Co3 Homo sapiens cDNA clone IMAGE:2326143 3' | 


xfS7e1 0.X1 NCI_CGAP_Gas4 Homo sapiens cDNA clone IMAG&2623146 3' similar to contains MER10.t2 I 
||VIER10 repetitive element ; j 


Homo sapiens chromosome 21 segment HS21 C085 ( 


jHono sapiens FRA3B common fragile region, diadenosine triphosphate hydrolase (FHIT) gene, exon 5 | 


1 Homo sapiens FRA3B common fragile region, diadenosine triphosphate hydrolase (FH IT) qene, exon 5 1 


Horo sapiens FRA3B common fragile region, diadenosine triphosphate hydrolase (FHIT) gene, exon 5 | 


CANALICULAR MULTISPECIFIC ORGANIC ANION TRANSPORTER 1 (MULTIDRUG RESISTANCE- 
|ASSOCIATED PROTEIN 2) (CANALICULAR MULTIDRUG RESISTANCE PROTEIN) | 


Xb03b05.x1 NCI_CGAP_GU1 Homo sapiens cDNA clone IMAGE25751 85 3' similar to contains L1 .t2 L1 I 
repetitive element ; | 


LINE-1 REVERSE TRANSCR PTASE HOMOLOG | 


S -ANTIGEN PROTEIN PRECURSOR \ 
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Top Hit Descriptor 


'CM4-PT0n34-180200-£06-a01 PT0034 Homo sapiens cDNA | 


CM4-PT0034-1 802G0-e06-a01 PT0034 Homo sapiens cDNA | 


jHcmo sapiens pituitary tumor transforming gene protein (PTTG) gene, complete cds 1 


af3Sg1 1 .s1 Soares_total_fetus_Nb2HF8_9w Homo sapiens cDNA clone IMAGE:1034084 3' similar to 
'contains OFR.t2 OFR repetitive element ; | 


;QVO-BN0148-070700-293-a10 BN0148 Homo sapiens cDNA j 


Homo sapiens SNCA isoform (SNCA) gene, complete cds, alternatively spliced j 


jHcmo sapiens CCR8 chemokine receptor I CMKBR8) gene, complete cds I 


IMITOGEN-ACTIVATED PROTEIN KINASE KINASE KINASE 10 (MIXED LINEAGE KINASE 2) (PROTEIN I 
KINASE MST) 


;Hcmo sapiens CCR8 chemokine receptor (CMKBR8) gene, complete cds ! 


|QV2-PT0012-040400-124-e06 PT0012 Homo sapiens cDNA ] 
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tg22c1 1 .x1 NCI CGAP_CLL1 Homo sapiens cDNA clone IMAGE:2109524 3' similar to contains MER28.t2 I 
1MER28 repetitive element; | 


xg49g12.x1 NCI_CGAPJJt1 Homo sapiens cDNA clone IMAGE2S30950 3 1 similar to contains OFR.t2 OFR j 


jHomo sapiens pituitary tumor transforming gene protein (PTTG) gene, complete cds | 
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;AV730759 HTF Homo sapiens cDNA done HTFAQB07 5' | 


iMus musculus dynein, axon, heavychaln 11 (Dnahcll), mRNA 


Mus musculus apolipoproteln B editing complex 2 (Apobec2), mRNA j 


| Hcmo sapiens putative MTAP (MTAP) mRNA, partial cds, alternatively spliced 


Mus musculus WNT-2 gene, partial cds; putative ankyrin-related protein and cystic fibrosis transmembrane 
conductance regulator (CFTR) genes, section 1 of 2 of ihe complete cds; and unknown gene 


RC1-HN0003-220300-021-b04 HN0003 Homo sapiens cDNA | 


ihi81d04.x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:2978695 3' similar to contains L1.t2| 
L1 repetitive element ; i 


jyc05hOS.r1 Stratagene lung (#93 721 0) Homo sapiens cDNA clone IMAGE:79839 5' ! 
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jnlQ6e05.s1 NCI_CGAP_Co10 Homo sapiens cDNA clone IMAGE:1058528 3' | 


Top Hit 
Database 


EST HUMAN [ 


EST HUMAN | 




| 


S 


NT 


NT 


SWISSPROT | 


NT 


1 




EST HUMAN 






EST HUMAN | 


NT 


a 


EST HUMAN j 










| 


EST HUMAN I 


| 


h 


EST HUMAN | 


Top Hit Acessior 


5 
1 


1 


|AF200719.1 | 


I 

5 


I 
1 


t 
1 


|U45983.1 


1 
§ 




I 


? 

8 

8 
1 


| 


1 
1 




]AW880701 .1 | 


]AL1 S3280.2 | 


|BE172081.1 


IAV730759.1 | 


| 




I 




I 


IaW6S2772.1 I 




1 


■i 


Most Similar 
(Top) Hit 
BLAST E 
Value 
























9.0E-17 










S.OE-171 


S.0E-17| 


LU 

s 


LU 








6.OE-I7I 




5.0E-17I 


4.0E-17, 


fl 

r 








8 






s 


3.03 








































ORFSEQ 
ID NO: 


s 


1 


I 










331 02ij 






8 












31951 | 




| 






1 


1 


32794I 




34255| 


§ 


Exon 
SEQID 


R 


R 


1 




1 


8 
8 




1 


1 


l 








1 


1 




I 










1 


1 


I 








SEQID 

NO: 


1 


1 


8 


1 


1 


1 


I 6703! 


i 


| 7985 [ 


1 


I 


s 


I 


1 10720[ 


| 1045[ 


1 


| 5775 | 




| 8210[ 






1 




s 

8 


1 


8 





340 




341 



WO 01/57273 



PCT/LS01/00664 



IIP 



3 

L | 

Hi 
111 

III 

ill 



: 1 s I 
i S 1 1 

MI 

Is 1 



ill 
ill 

! ! £ 



N,5 



Hi 

r § * -? 



3 I 111 



342 



WO 01/57273 



PCT/LS01/00664 




WO 01/57273 



PCT/US01/00664 



Top Hit Descriptor 


| OLFACTORY RECEPTOR 5 (M50) | 


[Homo sapiens Xq pseudoautosomal region; segment 1/2 | 


| DKFZp762F1 92_r1 762 (synonym: hme!2) Homo sapiens cDNA clone DKFZp762F1 92 5' j 


IZONA PELLUCIDA SPERM-BINDING PROTEIN B PRECURSOR (ZONA PELLUCIDA GLYCOPROTEIN I 
|ZP-X)(RC55) | 
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1 
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1 
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IHomo sapiens chromosome 21 segment HS21C009 | 


|Homo sapiens partial IL-12RB1 genefor IL-12 receptor betal chain, exon 14 j 


xj87bC2.x1 Soares„NFL_T_GBC_S1 Homo sapiens cDNA clone 1MAGE:2664171 3' similar to contains j 
[element MSR1 repetitive element ; | 


Human germline T-cell receptor beta chain — CRBV13S1, TCRBV6S8A2T, TCRBV5S6A3N2T, 
TCRBV13S6A2T, TCRBV6S9P, TCRBV5S3A2T, TCRBV13S8P, TCRBV6S3A1N1T. TCRBV5S2, 
TCRBV6S5A2T, TCRBV5S7P, TCRBV1 3S4, TCRBVSS2A1 N1 T, TCRBV5S4A2T, TCRBV6S4A1 , 
TCRBV23S1 A2T, TCRBV1 2> 


|Homo sapiens mRNA, chromosome 1 specific transcript KIAA0501 I 


j602130910F1 N!H_MGC_55 Homo sapiens cDNA clone IMAGE:42B7574 5' j 


IHomo sapiens mannosldase; beta A, lysosomal (MANBA) gene, and ubiquitin-conjugating enzyme E2D 3 1 
(UBE2D3) genes, complete cds | 


jBETA-2 ADRENERGIC RECEPTOR | 


1 BETA-2 ADRENERGIC RECEPTOR \ 


| AV7081 36 ADC Homo sapiens cDNA clone ADCAMA1 1 5' j 


|Homo sapiens NPD008 protein (NPD008) mRNA, complete cds | 


|Homo sapiens similar to aldc-keto reductase family 1 , member B1 1 (aldose reductase-like)(H. sapiens) I 
|(LOC63222), mRNA \ 


iM.musculus mRNA for TPCR33 protein j 


| Homo sapiens phorbolin 1 protein (PBI) mRNA, complete cds j 


;Homo sapiens chromosome 21 segment HS21C001 | 


|qo91e02.x1 NCI_CGAP_Kid5 Homo sapiens cDNA clone IMAGE:1915898 3' similar to TR.Q59386 Q69386 I 
|P0UENVGENE; | 


IAV731 382 HTF Homo sapiens cDNA clone HTFAZC05 5' i 


jMus musoulus keratin-associated protein 9-1 (Krtap9-1), mRNA \ 


|ze34c09,r1 Scares retina N2b4HR Homo sapiens cDNA clone IMAGE:360880 5' | 
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I op Mit Descriptor 


ZONADHESIN PRECURSOR | 


EST180326 Liver ll: Homo sapiens cDNA 5' end | 


Homo sapiens RGH1 gene, retrovirus-like element I 


Homo sapiens RGH1 cene, retrcvirus-like element | 


CHR2203- 0 Chromosome 22 exon Homo sapiens cDNA ctane C22_391 5' j 
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hr84b06.x1 NCI_CGAP_Kid11 Homo sapiens cDNA clone IMAGE:31 351 55 3' similar to contains Ll.t2 L1 I 
j repetitive element; | 


AF04E567 Human activated dendritic cell mRNA Homo sapiens cDNA clone GA05 j 


Homo sapiens Autosomal Highly Conserved Protein (AHCP), mRNA 


Homo sapiens calcium channel alphal E subunit (CACNA1E) gene, exons 7-49, and partial cds, alternatively I 

i spliced | 


r>c60gD8.r1 NCI_CGAP_Pr1 Homo sapiens cDNA clone IMAGE:745694 similar to contains L1 .t3 L1 I 

| repetitive element ; j 


AJ003514 Selected chromosome 21 cDNA library Homo sapiens cDNA clone MPIpl12-8.l21 j 
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Top Hit Descriptor 


TENASCIN-X PRECURSOR (TN-X) (HEXABRACHION-LIKE) 1 


TENASCIN-X PRECURSOR (TN-X) (HEXABRACHION-LIKE) I 


qs73f1 1.jc1 NCI_CGAP_Pr28 Homo sapiens cDNA clone 1MAGE:1943757 3' similar to TR:Q13537 Q13537 I 
MER37 TRANSPOSABLE ELEMENT, COMPLETE CONSENSUS SEQUENCE, ; I 


MR3-HT0487-1 50200-1 1 3-g01 HT0487 Homo sapiens cDNA I 
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yr16a02.r1 Soares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:205418 5' 


Homo sapiens cytochrome P450 polypeptide 43 (CYP3A43) gene, partial cds; cytochrome P450 polypeptide 
4 (CYP3A4)and cytochrome P450 polypeptide 7 (CYP3A7) genes, complete cds; and cytochrome P450 
poypeptide 5 (CYP3A5) gene, partial cds 


Hcmo sapiens chromosome 2' segment HS21C103 | 


Hcmo oapiens T cell receptor beta locus, TCRBV7S3A2 to TCRBV12S2 region | 
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zw82c06.r1 Scares Jestis_NHT Homo sapiens cDNA clone 1MAGE:782S98 5' similar to contains PTR5.t2 I 
PTR5 repetitive element ; | 


.601301762F1 NIH_MGC_21 Homo sapiens cDNA clone IMAGE:3636254 5' I 
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ab75a08.s1 Stratagene fetal retina 937202 Homo sapiens cDNA done !MAGE:85275B 3' similar to I 
TR:E19822 E19822 CA PROTEIN. ; [ 
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Top Hit Descriptor 


hk01b10.s1 NC1_CGAP_PM1 Homo sapiens cDNA clone IMAGE:1000699 similar to gb:M1788S 60S j 
ACIDIC RIBOSOMAL PROTEIN P1 (HUMAN); ! 
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Homo sapiens DNA, DLEC1 to ORCTL4 gene region, section 1/2 (DLEC1 , ORCTL3, ORCTL4 genes, I 
(compete cds) | 


iht09g01.x1 NCI_CGAP_Kid13 Homo sapiens cDNA clone IMAGE:3146256 3' similar to contains MER29.03 
MER29 repetitive element; 


jHomo sapiens Retina-derived POU-domain fa=tor-1 (RPF-1), mRNA j 


!hSPD204B1 HM3 Homo sapiens cDNA clone s4000095C10 | 


:Homc sapiens mRNA for KIAA0454 protein, psrtial cds | 
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Lv1 7c1 1 .x1 NCl_CGAP_Lu24 Homo sapiens cDNA clone IMAGE:3183188 3' similar :o TR:Q0 /314 Q0731 4 
SECRETED NEUREXIN lll-ALPHA-C PRECURSOR. [3] T=?:Q07280 TR:Q07313 ; : 


:AU126260 NT2RP1 Homo sapiens cDNA clone NT2RP1000443 5' 


au83ti08.x1 Schneider fetal brain 00004 Homo sapiens cDN A clone IMAGE:278291 1 3' similar to 
TR:O30302 O30302 KIAA0555 PROTEIN, contains element MER22 repetitive element ; [ 
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Homo sapiens gamn glutam\ttransfera -III s <-tivity"1 (GGTLA1), mRNA j 


Homo sapiens zinc finger protein ZNF1 91 (ZNF191) gene, complete cds ] 


aaS0e03.r1 NCI_CGAP_GCB1 Homo sapiens cDNA clone IMAGE:825340 5' similar to contains Alu 
repetitive element;contains element PTR5 repetitive element ; j 


wo18c07.x1 NCl_CGAP_Pan1 Homo sapiens cDNA clone IMAGE:2455692 3' similar to contains THR.M I 
THR repetitive element; 


xn33c09.x1 NCI_CGAP_Kid1 1 Homo sapiens cDNA clone IMAGE:2595504 3' similar to SW:GG95 HUMAN 
0.08379 GOLGIN-95. ; | 


qfS6f10.x1 Soares.Jestls_NHT Homo sapiens cDNA clone IMAGE:1755019 3' similar to gb:M19503 LINE-1 I 
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EST378521 MAGE resequences, MAGI Homo sapiens cDNA \ 


s 

s 

I 

1 

1 
1 
1 
1 

« 

§ 


jRattus norvegicus mRNA for 45kDa secretory protein, partial j 


wp69b01 .xl NCI_CGAP_Brn25 Homo sapiens cDNA clone IMAGE:2466985 3' similar to TR:015475 j 
01 5475 UNNAMED HERV-H PROTEIN contains LTR7.b1 LTR7 repetitive element ; | 


RC3-UT0052-210800-021-C05 UT0062 Homo sapiens cDNA ] 


Homo sapiens chromosome 21 segment HS21C0D3 | 


RC3-OT0091-170300-011-c12OT0091 Homo sapiens cDNA | 


cn1 5c02.x1 Normal Human Trabecular Bone Cells Homo sapiens cDNA clone NHTBC_cn1 5c02 random | 
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wd35g05.x1 Soares^N FL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:23301 70 3' similar to contains I 
MER7.9.t2 MER29 repetitive element ; j 


wd35gOS.x1 Sosres_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:23301 70 3' similar to contains 
MER2912 MER29 repetitive element ; j 


Human 90 kD heat shock protein gene, completecds j 


Human beta-galactoside alpha2,S-sialytrans;erase (SIAT1 ) mRNA, exon U j 


Homo sapiens PTS gene for 6-pyi uvoyitetiahydiopterin synthase, complete cds i 
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S01152657F1 NIHMGC1S Homo sapiens cDNA clone IMAGE:3508527 5' j 
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repetitive elementpcntains MER1S.t2 MER1 9 repetitive element ; j 


Homo sapiens chromosome 21 segment HS21C046 i 


ht09g01.x1 NCI_CGAP_Kld13 Homo sapiens cDNA clone IMAGE:3146256 3' similar to contains MER29.b3 | 
MERZ9 repetitive element; ! 


OLFACTORY RECEPTOR-LIKE PROTEIN F5 1 


Human HsLIM15 mRNAfor HsLim15, completecds j 


Homo sapiens envelope protein RIC-6 (env) gene, completecds j 


Homo sapiens envelope protein RIC-6 (env) gene, completecds • 
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iHuman chromosome 22 immunoglobulin V(K)I gene, part, with 5' breakpoint between crptv 
ne ghbouring non-amplified region 


tm34a1 0.x1 NC[_CGAP_Kid1 1 Homo sapiens cDNA clone IMAGE:2159994 3' similar to c 
|MER29 repetitive element ; 


|60151 1530F1 NIH_MGC_71 Homo sapiens cDNA clone IMAGE:3913087 5' 


;oh37c03.s1 NCI_CGAP_Kid5 Hcmo sapiens cDNA clone I MAGE: 1459972 3' similar to ct 

repetitive element ; 


Homo sapiens PR01 181 mRNA, complete cds 


Homo sauiens AT-binding transcription factor 1 (ATBF1), mRNA 


Homo sapiens AT-binding transcription factor 1 (ATBF1), mRNA 


Homo sapiens FLI-1 gene, partial 


AV731 500 HTF Homo sapiens cDNA clone HTFAKC07 5' 
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IAV758634 BM Homo sapiens cDMA clone BMFBBH125' 


,z95a07.s1 Scares fetaniver_spleen_1NFLS_S1 Homo sapiens cDNA clone IMAGE:448 
contains THR.tS THR repetitive element ; 
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IHomo sapiens myeloid/lymphoid or mixed-lineage leukemia (trithorax (Drosophila) homolog 
.(ULLT4) mRNA 


IHomo sapiens myeloid/lymphoid or mixed-lineage leukemia (trithorax (Drosophila) homotog 
(MLLT4) mRNA 
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j601173631F1 NIH_MGC_17 Homo sapiens cDNA clone IMAGE.3529159 5 


|Human cell 12-lipoxygenase mRNA, complete cds 


1 H .sapiens mRNA for myosin 


I 

E 
1 


|zn66c08.r1 Stratagene HeLa cell s3 937216 Homo sapiens oDNA clone IMAGE:563150 E 


|zn66c08.rl Stratagene HeLa cell s3 937216 Homo sapiens cDNA clone IMAGE:563150 E 


|Homo sapiens chromosome 1 lopen reading frame 9 (01 1 ORFg), mRNA 


Inw21g02.s1 NCI_CGAP_GCB0 Homo sapiens cDNA clone IMAGE:1241138 3' similar to 
|THR repetitive element ; 


Ihw07c05.x1 NCI_CGAP_Lu24 Homo sapiens cDNA clone IMAGE:318221S 3' similar to 1 
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A971 F Heart Homo sapiens cDNA clone A971 
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hi£6a1 2 x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:2B79166 3' sim 
'sW:TR12_HUMAN Q14669 THYROID RECEPTOR INTERACTING PROTEIN 12 ; 


Homo sapiens Grb2-associated binder 2 (KIAA0571), mRNA 


Homo sapiens Grb2-associated binder 2 (KIAA0571), mRNA 


iHomo sapiens mRNA for KIAA0895 protein, partial cds 


TCBAP2E4328 Pediatric pre-B cell acute lymphoblastic leukemia Baylor-HGSC projects 
jcDNA clone TCBAP4328 


TCBAP2E4328 Pediatric pre-B cell acute lymphoblastic leukemia Baylor-HGSC projects 
cDNA clone TC3AP4328 


yq19a12 r1 Soares fetal liver spleen 1 NFLS Homo sapiens cDNA clone IMAGE:274079 5 


UI-H-BI2-agc-b-10-0-U!.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:27236; 
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QY0-BT0701-210400-199-b04 BT0701 Homo sapiens cDNA 


CM2-MT0125-280700-297-GO2 MT0125 Homo sapiens cDNA 


CM2-MT0125-280700-297-G02 MT0125 Homo sapiens cDNA 


H.sapiens PROS-27 mRNA 


:QVO-BT0701-210400-199-b04 BT0701 Homo sapiens cDNA 


Homo sapiens chromosome 21 segment HS21 C01 0 


;fm"c16 Regional genomic DNA specific cDNA library Homo sapiens cDNA clone CR12-1 


|fmfc16 Regiona genomic DNA specific cDNA ibrary Homo sapiens cDNA clone CR12-1 
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SP:A44282 A44282 RETROVIRUS-RELATED POL POLYPROTEIN - HUMAN ; 


'Hcmo sapiens hypothetical protein (LOC51233), mRNA 


htC9g01 Jfl NCI_CGAP_Kid13 Homo sapiens cDNA clone IMAGE:31 46256 3' similar to < 
|MER29 repetitive element ; 


ht09g01 .x1 NCI_CGAP_Kld13 Homo sapiens cDNA clone IMAGE:3146256 3' similar to i 
1MER29 repetitive element; 


|AV650422 GLC Homo sapiens cDNA clone GLCCEF06 3' 
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Top Hit Descriptor 


qd61 c08.x1 Soares_teslis_NHT Homo sapiens cDNA clone IMAGE: 1733968 3' similar to contains P I R7.t3 I 
PTR7PTR7 repetitive element; | 


hu53aOS.x1 NCLCGAP_Brn41 Homo sapiens cDNA clone IMAGE:31 73750 3' similar to contains element I 
MER40 repetitive element ; ! 


hu53a08.x1 NCI_CGAP_Brn41 Homo sapiens cDNA clone IMAGE:31 73750 3' similarto contains element I 
MER40 repetitive element ; | 
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Hunan ribosomal protein L23a mRNA, complete cds | 


FB1 G5 Fetal brain, Stratagene Hcmo sapiens cDNA clone FB1 G5 3'end similar to LINt-1 j 


Homo sapiens Ras-like GTP-bind ng protein (RAB27A) gene, exons 1 b and 2 i 


Hcmo sapiens Ras-like GTP-binding protein (RAB27A) gene, exons 1b and 2 I 


602022313F1 NCI CGAP Brn67 Homo sapiens cDNA clone IMAGE:41 57566 5' j 


1 Hcmo sapiens pyruvate dehydrogenase kinase, isoenzyme 3 (PDK3) mKNA j 


1 Hcmo sapiens Sp4 transcription factor (SP4) mRNA j 


|Hcmo sapiens Sp4 transcripton factor (SP4) mRNA | 


yg40e01.r1 Soares infant brain 1NIB Homo sapiens cDNA clone IMAGE:34732 5' similarto j 
,SP:ED38_MOUSE P28B56 BRAIN PROTEIN DN38 ; 


]Homo sapiens vacuolar sorting protein 35 (VPS35) mRNA, complete cds i 


[Hcmo sapiens 8q22.1 region and MTG8 (CBFA2T1) gene, partial cds I 
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|wr87h01.x1 NCI CGAP Kid1 1 Homo sapiens cDNA clone IMAGE:2494705 3' j 


|Hcmu sapiens calcium channel, voltage-dependent, alpha 1 E subunit (CACNA1 b), mKNA 1 


|DKFZp761D1015 r1 761 (synonym: hamy2) Homo sapiens cDNA clone DKFZp761D1015 5' \ 


|wb99b04.x1 NCI CGAP Pr28 Homo sapiens cDNA clone IMAGE:2313775 3' : 


IHcmo sapiens cadherin EGF LAG seven-pass G-type receptor 1 (CELSR1), mRNA I 


|qh23g01.x1 Soares NFL_T GBC_S1 Homo sapiens cDN A clone IMAGE:1 845552 3' j 
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Hcmo sapiens RAB36 (RAB36) mRNA, complete cds 


hw14gOS.x1 NCI_CGAP_Lu24 Homo sapiens cDNA clone IUAGE.-3182938 3' similar to 
P22359 OXYSTEROL-BINDING PROTEIN. ; 


Hcmo sapiens tissue-type bone marrow zinc finger protein 4 mRNA, complete cds 


Homo sapiens tumor necrosis factor (ligand) superfamlly, member 10 (TNFSF10) mRNA 
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Homo sapiens vesicle transport-related protein (KIAA091 7), mRNA 


s 

s 

I 

1 

1 

1 

s 

1 

i 


TCBAP1E2795 Pediatric pre-B cell acute lymphoblastic leukemia Baytor-HGSC project* 
cDNA clone TCBAP2795 


Homo sapiens Missliapen/NK-reiated kinase (MINK), mRNA 


|Homo sapiens Misshapen/NIK-related kinase(MINK), mRNA 


jRC1-CT0249-030300-026-h12 CT0249 Homo sapiens cDNA 


| RC1 -BN0039-1 1 0300-0 1 2-b01 BN0039 Homo sapiens cDN A 


8 

I 
1 

1 

1 
e 

1 

1 

J 


|zw53du2.M Soares_total_fetus_Nb2HF8_9w Homo sapiens oDNA done IMAGE:7737K 
|contains THR.tS THR repetitive element ; 


Izw53d02.r1 Soares_tota!_fetus_Nb2HF8_Bw Homo sapiens cDNA clone IMAGE:77376: 
|ccntains THR.13 THR repetitive element ; 


Homo sapiens transcription factor IGHM enhancer 3, JM11 protein, JM4 protein, JM5 pn 
JM10 protein, A4 differentiation-dependent protein, triple LIM domain protein 6, and syna 
complete cds; and L-type calcium channel a> 


|aa01c09.s1 Soares_NhHMPu_S1 Homo sapiens cDNA clone !MAGE:811984 3' 


|EST379147 MAGE resequences, MAGJ Homo sapiens cDNA 


Top Hit 
Database 




EST HUMAN | 










I EST HUMAN | 


H 
















jEST HUMAN | 








jEST HUMAN ' 


| EST HUMAN 




| EST HUMAN 


|est HUMAN 




| EST HUMAN 


|EST_HUMAN 










s 












1 






1 










1 


I 


















| 


1 

8 


BE465325.1 


1 

£ 


1 


iD25303.1 




1 
s 


1 


i 




1 

< 






I 




1 








| AW 853132.1 


1 


1 

5 


IaA434554.1 


|aA434554.1 




1 
I 




i sllJ 




1 




LU 


9 






Li 




1 


1 


E-44 


E-44 


I 


E-44' 


E-44 


1 


E-44 


1 


E-44 


2 




E44 


T 






l! 


Most Sii 
(Top)i 
BLAS" 










s 
















































sq 


























o 










1 






1 


1 








|l 


























































gj 




S 

R 




8 




305571 


30767 


i? 


s 

8 






1 




35282I 


1 




I 


1 


1 






B 
S 










Isi 




! 




§ 
s 


1 




I 




1 




8 

3 


s 






a 




1 






1 




! 






1 


























































Probe 
SEQID 

NO: 


1 


1 


8 
5 


2570 


s 




! 


1 


1 


I 


S 


is 




1 


8 


9178 




s 






1 


1 


1 




1 




1 



396 



WO 01/57273 



PCT/LS01/00664 




WO 01/57273 PCT/LS01/00664 




398 



WO 01/57273 PCT/LS01/00664 




399 



WO 01/57273 



PCT/LS01/00664 



Top Hit Descriptor 


7d81 g01 .x1 Lupski__dorsal_root_ganglion Homo sapiens cDNA clone IMAGE.3279408 3' j 


naa38f07.x1 NCI CGAP_Kid1 1 Homo sapiens cDNA clone I MAGE:3258767 3' similar to TR:O75202 | 
075202 HOMOLOG OF RAT KIDNEY-SPECIFIC ; | 


602021164F1 NCI_CGAP_Brn67 Homo sapiens cDNA clone IMAGE:4156670 5' ] 
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ov85g0S.x1 Soares_testis_NHT Homo sapiens cDNA clone IMAGE:1644160 3' j 


Homo sapiens mRNA for K1AA0B03 protein, partial cds j 


601120116F1 NIH_MGC 20 Homo sapiens cDNA clone IMAGE:2967027 5' I 


60 1 1 20 1 1 6F1 NIH„MGC_20 Homo sapiens cDNA clone IMAGE:2967027 5' I 


Homo sapiens SMA3 (SMA3), mRNA i 


Homo sapiens testis-speeific Testis Transcript Y 1 (TTY1 ) mRNA, partial cds j 
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Homo sapiens mRNA for KIAAO405 protein, partial cds j 


"8 

s 
a 
I 

£ 
I 


43c5 Human retina cDNA randomly primed sublibrary Homo sapiens cDNA | 


Homo sapiens DKFZP586G1219 protein (DKFZP586G1219), mRNA ( 


Homo sapiens chromo5ome21 segment HS21 C0S7 | 


Hcmo sapiens chromosome 21 segment HS21 C01 0 ! 


yv44g03.r1 Scares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE:245620 5 1 


Homo sapiens DSCR5b mRNA, complete cds , 


;Homo sapiens DSCR5b mRNA, complete cds | 


1 

| 

1 

rf 

| 


IHomo sapiens PR01851 mRNA, complete cds j 


'Homo sapiens hect domain and RLD 2 (HERC2), mRNA 


Hcmo sapiens hect domain and RLD 2(HF.RC2), mRNA 1 


Hcmo sapiens F-box protein FBL4 (FBL4) mRNA, complete cds 1 


:Homo sapiens cytochrome P450, 51 (lanosterol 14-alpha-de-nethylase) (CYP51) mRNA I 


iHcmo sapiens discs, large (Drosophila) homolog 2 (chapsyn-1 10) (DLG2), mRNA I 


jHcmo sapiens discs, large (Drosophila) homolog 2 (chapsyn-1 1 0) (DLG2), mRNA | 


IHcmo sapiens phospholipid scramblase 1 gene, complete cds | 


1 
1 

y 

a 

1 

• 1 
| 

i 


|Homo sapiens chromosome 21 segment HS21C010 j 


|Humaii infant brain unknown procuct mRNA, complete cds 1 


|sea;1575 b4HB3MA Cot8-HAP-Ft Homo sapiens cDNA done b4HB3MA-COT8-HAP-Ft61 5' similar to similar 
|to Chinese Hamster DH FR-coamplified protein mRNA ! 


| Homo sapiens DNA-binding protein (LOC56242), mRNA ! 


|601237702F1 NIH_MGC_44 Homo sapiens cDNA clone IMAGE:3609552 5' | 


If 


I 


ES- HUMAN | 




b 


1 














EST HUMAN 1 








5 

2 


































|EST HUMAN 


1 

1 




I 


AB020710.1 ! 


BE277861.1 j 


BE277861.1 


8 
S 


AF000990.1 | 




AB007866.2 | 


i 
1 


1 


§ 
a 

5 


I 




1 
I 


s 


1 
1 


i 
1 


BE077198.1 


i 

E 






i 


1 


1 


1 

f 


< < 


jAL1 63210.2 




1 

8 


|ti 0045.1 




s 

1 


Most Similar 
(Top) Hit 


BLAST E 




8 


.OE-55 






.OE-55 


8 


s 


1 


.0E-55I 


s 


.OE-55 




.OE-55 1 


.0E-55I 


.OE-55 


s 


.0E-55I 


s 


.0E-55I 


s 




s 


3 


8 




.OE-55 






1 


.OE-55! 


« 


































































Expression 






K 










83 




Si 


8 


R 


















S 




SB 
































8 


















































O 

I 9 




8 


1 






I 28372 


I 




1 


i 


29410 




1 


1 




30798 






1 


8 






1 
s 


S 




i 


E 1 


fe 




1 


1 




lag 




1 


1 


§ 


I 


1 


1 


1 


1 




1 




5 




1 






1 


1 




1 






9 












1 

a 


8 


§ 




















































Probe 
SEQ ID 
NO: 


s 




1 


1 


1 


i 


I 


1 




1 


8 


1 


1 


1 


4843 


I 


1 


1 


§ 


1 


I 


1 


8013 


I 


1 


i 


ii 


1 


1 


1 


8 





422 




423 



WO 01/57273 PCT/LS01/00664 




424 



WO 01/57273 



PCT/LS01/00664 



5 





| 

a 


i 
§ 


I 

< 


1 


Top Hit Descriptor 


xrO5d10.x1 NCI CGAP Brn53 Homo sapiens cDNA clone IMAGE:27592b1 3' similar to ( 
INTERFERON-GAMMA RECEPTOR BETA CHAIN PRECURSOR (HUMAN); 
zv51b12.rl Scares testis NHT Homo sapiens cDNA clone IMAGE7571 51 5' 
Homo sapiens EphA4 (EPHA4) mRNA 
Hcmo sapiens EphA4 (EPHA4) mRNA 

Homo sapiens glutamate receptor, ionotrophic, AMPA 4 (GRIA4) mRNA 

Hcmo sapiens acanitase2, mitochondrial (AC02), mRNA 

Hcmo sapiens mRNA for KIAA0898 protein, partial cds 

Homo sapiens mRNA for KIAA0950 p'otein, partial cds 

Homo sapiens mRNA for KIAA0930 protein, partial cds 

Hcmo sapiens KIAA0716 gene product (KIAA0716), mRNA 

Hcmo sapiens mRNA for KIAA0837 protein, partial cds 


Homo sapiens mRNA for KIAA0837 protein, partial cds 

Homo sapiens hypothetical protein FLJ20371 (FLJ20371). mRNA 

Homo sapiens ninein (LOC51 199), mRNA 
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Homo sapiens DNA, DLEC1 to ORCTL4 gene region, section 1/2 (DLEC1, ORCTL3, OF 
complete cds) 
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Top Hit Descriptor 


Homo sapiens hypothetical protein (LOC51 318), mRNA 
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Homo sapiens chromosome 21 segment HS21 C084 I 


qb77c09.x1 Soares_fetal_hean_NbHHl9W Homo sapiens cDNA clone IMAGE:1706128 3' similar to j 
SW :K1 CJJutOUSE P02535 KERATIN, TYPE I CYTOSKELETAL 10 ; j 


Ho-no sapiens a disintegrin and metalloproteinase domain 22 (ADAM22), mRNA j 


Homo sapiens a disintegrin and metalloproteinase domain 22 (ADAM22), mRNA j 


O.cuniculus mRNA for elongation factor 1 alpha [ 


5 

N 

1 

£ 

•I 
1 

1 
I 

1 

a 

I 
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IL3-HT0619-060700-198-D10 HT0619 Homo sapiens cDNA ] 


IL5-HT0702-1 60600-1 03-dOS HT0702 Homo sapiens cDNA I 


; DKFZp434N0323 M 434 (synonym: htes3) Homo sapiens cDNA clone DKFZp434N0323 5' 


;DKFZp434N0323_r1 434 (synonym: htes3) Homo sapiens cDNA clone DKFZp434N0323 5' I 
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Human mRNA from chromosome 15 gene with homology to MHC-HLA-SB-1 intron A 


IHuman mRNA from chromosome 1 5 gene with homologyto MHC-HLA-SB-1 intron A 


Homo sapiens hormonally upregulated neu tumor-associated kinase (HUNK), mRNA ] 


| Homo sapiens mRNA for KIAA1 081 protein, partial cds 


Homo sapiens similar to SET translocation (myeloid leukemia-associated) (H. sapiens) (LOC631 02), mRNA | 
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IHomo sapiens myeloid/lymphoid or mixed-lineage leukemia (irithorax (Drosophila) homolog); translocated to, 4 


(MLLT4) mRNA 

ETS-RELATED PROTEIN 71 (ETS TRANSLOCATION VARIANT 2) 


|Human transcription factor NFATx3 mRNA, complete cds ! 


ITCBAP1E4051 Pediatric pre-B cell acute lymphoblastic leukemia Baylor-HGSC project=TCBA Homo sapiensl 
|cDNA clone TCBAP4051 I 
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Top Hit Descriptor 

Homo sapiens Calsenilin, presenilin-binding protein, EF hand transcription factor (CSEN) mRNA 
Homo sapiens SNARE protein kinase SNAK mRNA, complete cds 
Homo sapiens SNARE protein kinase SNAK mRNA, complete cds 
Homo sapiens dynein, axonemal, light polypeptide 4 (DNAL4), mRNA 
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aj62b09.s1 Scares testis NHT Homo sapiens cDNA clone lh/IAGE-1394873 3' 

os91g03.s1 NCI_CGAP_GC3 Homo sapiens cDNA clone IMAGE:1612756 3' similar to gb M16342 

HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEINS C1/C2 (HUMAN)- 

Homo sapiens chromosome 21 segment HS21C046 


601142409F1 NIH_MGC_14 Homo sapiens oDNA clone IMAGE:3506186 5' 

Homo sapiens similar to sema domain, immunoglobulin domain (Ig), short basic domain secreted 

(semaphorin) 3A (H. sapiens) (LOC53232), mRNA 

Homo sapiens hormonally upregulated neu tumor-associated kinase (HUNK), mRNA 
Homo sapiens hormonally upregulated neu tumor-associated kinase (HUNK), mRNA 
Homo sapiens complement component 8, beta polypeptide (C8B) mRNA 
DKFZp434E246_r1 434 (synonym: htes3) Homo sapiens cDNA clone DKFZp434E246 5' 

H.sapiens CLN3 gene, complete CDS 

H.sapiens CLN3 gene, complete CDS 

Homo sapiens plastin 3 (T isoform) (PLS3), mRNA 

Homo sapiens plastin 3 (T isoform) (PLS3), mRNA " 

-omo sapiens astin related protein 2/3 complex, subunit 1A (41 kD) (ARPC1A), mRNA 
Homo sapiens KIAA0433 protein (KIAA0433), mRNA 

Homo sapiens KIAA0433 protein (KIAA0433), mRNA 
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J Human N-ethylmaleimide-sensiti /e factor mRNA, partial cds | 


[Homo sapiens chromosome 21 segment HS21 C085 I 


j Human GT24 (GT24) mRNA, partial cds j 


Lorno sapiens solute carrier family 24 (sodium/pcrtassium/calcium exchanger), member 2 (SLC24A2), mRNA 


[Homo sapiens partial mRNA for PEX5 related protein I 


j Homo sapiens mRNA for KIAA1333 protein, partial cds j 


]Homo sapiens CaBP5 (CABP5) gene, exon 5 j 


! 
1 

1 
1 
1 
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Homo sapiens integrin, alpha 3 (antigen CD49C, alpha 3 subunit of VLA-3 receptor) (ITGA3), mRNA 
Homo sapiens cell adhesion molecule with homology to L1 CAM (close homologue of L1 ) (CHL1 ), mRNA 


Human MAGE-7 antigen (MAGE7) pseudogene, complete cds j 


Homo sapiens mRNA for kinesin-like protein, complete cds j 


:hr81d09.x1 NCI_CGAP_Kid11 Homo sapiens cDNA clone IMAGE:3134897 3' similar to TR:054778 054778 
SOLUTE CARRIER FAMILY 22 -LIKE 2 PROTEIN ; j 


hr81d09.x1 NCI_CGAP_Kid11 Homo sapiens cDNA clone IMAGE:3134897 3' similar to TR:054778 054778 
SOLUTE CARRIER FAMILY 22 -LIKE 2 PROTEIN ; \ 


Homo sapiens chromosome 21 segment HS21C046 | 


Homo sapiens chromosome 21 segment HS21 C046 j 


Homo sapiens chromosome 21 segment HS21 C046 j 


Homo sapiens chromosome 21 segment HS21 C046 >: 


7e36f08.x1 NCI_CGAP_Lu24 Homo sapiens cDNA clone IMAGE:3284583 3' j 


7e3Sf08.x1 NCI_CGAP_Lu24 Homo sapiens cDNA clone IMAGE:3284583 3' 


RC1-HT0598-120400-022-b08 HT0598 Homo sapienscDNA | 


qgS6c08x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:1843022 3' similar to gb:J04131 
GAVMA-GLUTAMYLTRANSPEPTIDASE 1 PRECURSOR (HUMAN);contaJns Alu repetitive element; 
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Top Hit Descriptor 


) sapiens RaP2 Interacting protein 8 (RPIP8), mRNA 


) sapiens RaP2 interacting protein 8 (RPIP8), mRNA 


305. y3 NIH_MGC_10 Homo sapiens cDNA clone IMAGE:289£ 
JTHETICAL35.6 KD PROTEIN. ; 


) sapiens similar to laminin receptor 1 (67kD, ribosomal protein 


) sapiens similar to laminin receptor 1 (67kD, ribosomal protein 
8985 HEMBA1 Homo sapiens cDNA clone HEMBA1 004795 f 


3985 HEMBA1 Homo sapiens cDNA clone HEMBA1 004795 £ 


) sapiens myosin, heavy polypeptide 4, skeletal muscle (MYH4) 


) sapiens amyloid beta (A4) precursor protein (protease nexin-ll 
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) sapiens mRNA for T-box transcription factor (TBX20 gene), p 


) sapiens mRNA for T-box transcription factor (TBX20 gene), p 


) sapiens ALR-like protein mRNA, partial cds 


) sapiens ALR-like protein mRNA, partial cds 


i sapiens Kruppel-like factor 7 (ubiquitous) (KLF7), mRNA 


) sapiens orotein phosphatase 2A BR gamma subunit gene, ex 


> sapiens protein phosphatase 2A BR gamma subunit gene, ex 


) sapiens similar to SALL1 (sal (Drosophila)-like (LOC57167), r 


) sapiens chromosome 8 open reading frame 2 (CSORF2), mR 


i sapiens mRNA for KIAA0903 protein, partial cds 


) sapiens mRNA for KIAA0903 protein, partial cds 


i sapiens soluble interleukin 1 receptor accessory protein (1L1R 
omplete cds, alternatively spliced 


) sapiens mRNA for KIAA0633 protein, partial cds 


> sapiens Kl AA0623 gene product (KIAA0623), mRNA 


i sapiens cytochrome P450, 51 (lanosterol 14-alpha-demethyla: 


Human retina-derived POU-domain factor-1 mRNA, complete cds 


i sapiens glutamate receptor, ionotropic, N-methyl D-aspartate : 
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Top Hit Descriptor 


I Homo sapiens sdute earner family 1 (high affinity aspartate/glufamate transporter), member 6 (SLC1A6), 
mRNA | 


] Homo sapiens brefeldin A-inhibited guanine nucleotide-exchange protein 2 (BIG2), mRNA | 


| Homo sapiens SUCA isoform (3NCA ) gene, complete cds, alternatively spliced 1 


]Homosap!ensCGI-15proteln(LOC51006), mRNA \ 


j Homo sapiens CGI-1 5 protein (LOC51006), mRNA | 


| Human branched chain alpha-keto acid dehydrogenase mRNA, 3' end | 


IHUM000S381 Liver HepG2 cell line. Homo sapiens cDNA clone s381 3' 


Rattus norvegicus brain specific coriactin-binding protein CBP90 mRNA, partial cds j 


I Homo sapiens makorin, ring finger protein, 1 (MKRN1), mRNA | 


[CM-BT043-O90299-O75 BTQ43 Homo sapiens cDNA "~ " 


in 
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i 
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i 
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I 


I AU1 4353B Y79AA1 Homo sapiens cDNA clone Y79AA1 002087 5' i 


AU143539Y79AA1 Homo sapiens cDN A clone Y79AA1 002087 5' | 


Homo sapiens chromosome 22 open reading frame 5 (C220RF5), mRNA j 


Homo sapiens chromosome 22 open reading frame 5 [C220RF5), mRNA | 


au4Sf09.x1 Schneider fetal brain 00004 Homo sapiens cDNA clone IMAGE:2518121 3' similar to 
SW:ASPG_FLAME Q47898 N4-(BETA-N-ACETYLGLUCOSAMINYL)-L-ASPARAGINASE PRECURSOR : 


AVS49878 GLC Homo sapiens cDNA clone GLCBYF08 3 | 


AV649878 GLC Homo sapiens cDNA clone GLCBYF0B 3' 1 


Homo sapiens lysophosphatidic acid acyltransferase-delta (LPAAT-delta) mRNA, complete cds j 


Homo sapiens lysophosphatidic acid acyitransferase-delta(LPAAT-delta) mRNA, complete cds j 


Homo sapiens chromosome 21 segment HS21 C084 \ 


EST01 579 Hippocampus, Stratagene (cat. #936205) Horns sapiens cDNA clone HHCMC60 similar to 
Retrovirus-related gag polyprotein | 


EST01 579 Hippocampus, Stratagene (cat. #936205) Homo sapiens cDNA clone HHCMC60 similar to 
Retrovirus-related gag polyprotein i 


Homo sapiens solute carrier family 4, anion exchanger, member 3 (SLC4A3), mRNA 


Homo sapiens solute carrier family 4, anion exchanger, member 3 (SLC4A3), mRNA | 


Homo sapiens ubiquitJn-conjugating BIR-domain enzyme APOLLON mRNA, complete cds 


bb49b02.x1 NIH_MGC_17 Horr.o sapiens cDNA clone IMAGE:3009963 3' similar to TR:O2S061 029061 2- 
□EOXV-D-GLUCONATE 3-DEHYDROGENASE contains Alu repetitive element; | 
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Top Hit Descriptor 
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|Homo sapiens beta-ureidopropionaee(BUP1)gene l cxon 6 


|Homo sapiens chromosome 21 segment HS21 C083 


|Homo sapiens mRNA for KIAA1278 protein, partial cds 


|Homo sapiens mRNA for KIAA12/8 protein, partial cds 


[Homo sapiens cyclin-D binding Myb-Iike protein mRNA, complete cds 


|Human Ku ( P 70/p80) subunit mRNA, complete cds 




| homo sapiens chromosome 21 segment HS21C085 


|Homo sapiens epididymal secretory protein (19.5kD) (HE1), mRNA 


1 

1 
| 

! 
1 


|hc«tio sapiens gamma-aminobutyric aeid (GABA) B receptor, 1 (GABBR1 ), transcript variant 


Homo sapiens gamma-aminobutyric acid (GABA) B receptor, 1 (GABBR1), transcript variant 


I Human L-type calcium channel beta-1 subunit (CACNLB1) gene, exons 10 and 11 


Human L-type calcium channel beta-1 subunit (CACNLB1) gene, exons 10 and 11 


Homo sapens ankyrin-like with transmembrane domains 1 (ANKTM1), mRNA 
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Top Hit Descriptor 


AV733454 cdA Homo sapiens cDNA clone cdABA08 5' 


AV733454 cdA Homo sapiens cDNA clone cdABA08 5' 


Homo sapiens TNF-inducible protein CG12-1 (CG12-1), mRNA 


Homo sapiens hypothetical protein (DJ1042K10.2), mRNA 


Homo sapiens hypothetical protein (DJ1042K10.2), mRNA 
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Top Hit Descriptor 


qf43a1 1 .x1 Soares_testis_NHT Homo sapiens cDNA clone IMAGE:1752764 3' similar to TR-.Q13458 I 
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Top Hit Descriptor 


Homo sapiens HOXD13 gene for homeobox transcription factor, complete cds I 


:601140485F1 NIH_MGC_9 Homo sapiens cDNA clone IMAGE304B820 5' 


Human glucose transporter (GLUT4) gene, completecds 1 
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jHomo sapiens gamma-aminobutyric acid (GABA) A receptor, alpha 2 (GABRA2), mRNA 1 


|Homo sapiens DNA for prostacyclin synthase, exon 8 I 
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[PHOSPHOLIPASE A2-GAMMA ; 


|iaD5g05.y1 Human Pancreafic islets Homo sapiens cDNA 5' cimilar to TR.075457 075457 CYTOSOLIC I 
1 PHOSPHOLIPASE A2-GAMMA, ; 1 


| Homo sapiens COX1 1 (yeast) homolog, cytochrome c oxidase assembly protein (COX1 1), mRNA | 
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Homo sapiens mRNA for KIAA0910 protein, partial eds 


! Homo sapiens pericentrin (PCNT) mRNA 


Homo sapiens T-cell lymphoma invasion and metastasis 1 (TIAM1 ) mRNA I 
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s hypothetical protein D<FZp761P1010 (DKFZp761P1010), mRNA I 


\NA for NK receptor (1 83 ActI) ( 
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!-1 50200-01 2-d03 BT0642 Homo sapiens cDNA j 


genous retrovirus-K, LTR U5 and gag gene f 


2I_CGAP_GC3 Homo sapiens cDNA clone IMAGE:2244612 3' | 
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1. A spatially-addressable set of single exon nucleic acid 
probes for measuring gene expression in a sample derived 
5 from human adult liver comprising a plurality single exon 
nucleic probes, said probes comprising any one of the 
nucleotide sequences set out in SEQ ID NOs: 1 - 13,109 or a 
complementary sequence, or a portion of such a sequence. 

10 2. A spatially-addressable set of single exon nucleic acid 
probes as claimed in claim 1 wherein each of said plurality 
of probes is separately and addressably amplifiable. 

3. A spatially-addressable set of single exon nucleic acid 
15 probes as claimed in claim 1 wherein each of said plurality 

of probes is separately and addressably isolatable from 
said plurality. 

4 . A spatially-addressable set of single exon nucleic acid 
20 probes as claimed in any of claims 1 to 3 wherein said 

probes comprise any one of the nucleotide sequences set out 
in SEQ ID NOS.: 13,110 - 25,995. 

5. A spatially- addressable set of single exon nucleic acid 
25 probes as claimed in ^ any of claims 1 to 4, wherein each of 

said plurality of probes is amplifiable using at least one 
common primer. 

6. A spatially-addressable set of single exon nucleic acid 
30 probes as claimed in any of claims 1 to 5 wherein the set 

comprises between 50 - 20,000 single exon nucleic acid 
probes . 

7. A spatially-addressable set of single exon nucleic acid 
35 probes as claimed in any of claims 1 to 6, wherein the 
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average length of the single exon nucleic acid probes is 
between 200 and 500 bp. 



8. A spatially-addressable set of single exon nucleic acid 

5 probes as claimed in any of claims 1 to 7, wherein at least 
50% of said single exon nucleic acid probes lack 
prokaryotic and bacteriophage vector sequence. 

9. A spatially-addressable set of single exon nucleic acid 
10 probes as claimed in any of claims 1 to 8, wherein at least 

50% of said single exon nucleic acid probes lack 
homopolymeric stretches of A or T. 

10. A spatially-addressable set of single exon nucleic acid 
15 probes as claimed in any of claims 1-9 characterised in 

that said set of probes is addressably disposed upon a 
substrate . 

11. A spatially-addressable set of single exon nucleic acid 
20 probes as claimed in claim 10 wherein said substrate is 

selected from glass, amorphous silicon, crystalline silicon 
and plastic. 

12. A microarray comprising a spatially addressable set of 
25 single exon nucleic acid probes as claimed in any of claims 

1 - 11. 

13. A single exon nucleic acid probe for measuring human 
gene expression in a sample derived from human adult liver 

30 comprising a nucleotide sequence as set out in any of SEQ 
ID NOs.: .1 - 13,109 or a complementary sequence or a 
fragment thereof wherein said probe hybridizes at high 
stringency to a nucleic acid molecule expressed in the 
human adult liver. 



WO 01/57273 PCT7US01/00664 

14. A single exon nucleic acid probe as claimed in claim 13 
comprising a nucleotide sequence as set out. in any of SEQ 
ID NOs.: 13,110 - 25,995 or a complementary sequence or a 
fragment thereof. 

5 

15. A single exon nucleic acid probe for measuring human 
gene expression in a sample derived from human adult liver 
which is a nucleic acid molecule having a sequence encoding 
a peptide comprising a peptide sequence as set out in any 

10 of SEQ ID NOs.: 25,996 - 38,578, or a complementary 
sequence or a fragment thereof wherein said probe 
hybridizes at high stringency to a nucleic acid expressed 
in the human adult liver. 

15 16. A single exon nucleic acid probe as claimed in any one 
of claims 13 to 15 wherein said single exon nucleic acid 
probe comprises between 15 and 25 contiguous nucleotides of 
said SEQ ID NO. 

20 17. A single exon nucleic acid probe as claimed in any one 
of claims 13 to 15, wherein said probe is between 3 - 25 kb 
in length. 

18. A single exon nucleic acid probe as claimed in any one 
25 of claims 13 - 17, wherein said probe is DNA, RNA or PNA. 

19. A single exon nucleic acid probe as claimed in any one 
of claims 13 - 18, wherein said probe is detectably 
labeled. 

30 

20. A single exon nucleic acid probe as claimed in any one 
of claims 13 - 19, wherein said probe lacks prokaryotic and 
bacteriophage vector sequence. 

35 21. A single exon nucleic acid probe as claimed in any one 
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of claims 13 - 20, wherein said probe lacks homopolymeric 
stretches of A or T. 



22. A method of measuring gene expression in a sample 
5 derived from human adult liver, comprising: 

contacting the microarray of claim 12, with a first 
collection of detectably labeled nucleic acids, 
said first collection of nucleic acids derived 
from mRNA of human adult liver; and then 
10 measuring the label detectably bound to each probe of 

said microarray. 

23. A method of identifying exons in a eukaryotic genome, 
comprising : 

15 algorithmically predicting at least one exon from 

genomic sequence of said eukaryote; and then 
detecting specific hybridization of detectably labeled 
nucleic acids to a single exon probe, 
wherein said detectably labeled nucleic acids are derived 
20 from mRNA from the adult liver of said eukaryote, said 

probe is a single exon probe having a fragment identical in 
sequence to, or complementary in sequence to, said 
predicted exon, said probe is included within a microarray 
according to claim 12, and said fragment is selectively 
25 hybridizable at high stringency. 

24. A method of assigning exons to a single gene, 
comprising: 

identifying a plurality of exons from genomic 
30 sequence according to the method of claim 2 3; and 

then 

measuring the expression of each of said exons in a 
plurality of tissues and/or cell types using 
hybridization to single exon microarrays having a 
35 probe with said exon, 



646 



WO 01/57273 PCT7US01/00664 

wherein a common pattern of expression of said exons in 
said plurality of tissues and/or cell types indicates that 
the exons should be assigned to a single gene. 

5 25. A nucleic acid sequence as set out in any of SEQ ID 
NOs: 1 - 25,995 which encodes a peptide. 

26. A peptide encoded by a sequence as set out in any of 
SEQ ID Nos: 1 - 25,995. 

10 

27. A peptide comprising a sequence as set out in any of 
SEQ ID Nos: 25,996 - 38,578. 
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